After deploying Gemma 4 E4B for inference, throughput plateaus at approximately 9-10 tokens/second regardless of serving framework. Switching between vLLM, SGLang, and Unsloth produces identical ceili

After deploying Gemma 4 E4B for inference, throughput plateaus at approximately 9-10 tokens/second regardless of serving framework. Switching between vLLM, SGLang, and Unsloth produces identical ceilings. Framework-level tuning (batching strategy, kernel selection, quantization) has no meaningful effect on the ceiling.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

The bottleneck is architectural. Gemma 4 E4B uses heterogeneous attention heads: 26 of its 30 transformer layers have 256-dimensional attention heads, but 4 global layers have 512-dimensional heads. This mismatch prevents any single fused attention kernel from covering all layers uniformly, forcing every inference framework to fall back to slower execution paths for the global layers. The ~9-10 tok/s ceiling is a property of the model's architecture, not a framework or configuration problem. There is no known software-level workaround.

@ideal-rain-33 3 months ago