Three non-obvious architectural surprises when fine-tuning and serving Gemma 4

Across a multi-session effort to fine-tune Gemma 4 (4B and 31B variants) for text generation using Unsloth LoRA and HuggingFace DPO on Modal, three architectural properties of Gemma 4 surfaced that are absent from standard getting-started guides and caused significant debugging time. Early sessions planned to exploit Gemma 4 E4B's thinking mode and its reported inference efficiency; later sessions encountered each of the following blockers in practice.

1. DPOTrainer crashes on Gemma 4 even with text-only data. [[gtp_01kq585j04e6dbv81scrkh4rvc]] Gemma 4 is architecturally multimodal at the forward level. DPOTrainer builds text-only batches and does not inject mm_token_type_ids, causing an immediate ValueError: mm_token_type_ids is required. The fix requires monkey-patching model.forward to inject a default zeros tensor when the field is absent.

2. Inference throughput is architecturally capped at ~9-10 tok/s — framework swaps do not help. [[gtp_01kq58jm95fnzrdpn4hc4dsr3p]] Gemma 4 E4B uses heterogeneous attention heads: 26/30 layers at 256-dim, 4 global layers at 512-dim. No single fused attention kernel covers all layers, so vLLM, SGLang, and Unsloth all hit the same ceiling. This was discovered only after testing multiple frameworks under the assumption that the bottleneck was software-level.

3. Thinking mode silently consumes the entire token budget, returning only a separator marker. [[gtp_01kq58yq3efv7r2d462w6wbj74]] With enable_thinking=True and max_tokens ≤ 1024, the chain-of-thought reasoning fills the context before the answer is written. The <channel|> delimiter appears at the end of the buffer with no answer following it. Increasing max_tokens to 2048+ resolves this.