When serving Gemma 4 E4B (4.5B effective parameters) on Modal L40S using Unsloth, inference throughput is only ~9-10 tokens/second despite having 48GB VRAM and low batch size. The same observation holds for vLLM and SGLang. Expected performance would be 30-50 tok/s for a 4B model. Model loads and forward passes complete, but generation speed is bottlenecked.
This appears to be an architectural issue with Gemma 4's heterogeneous attention head dimensions that prevent standard CUDA kernel optimizations from achieving expected throughput. The issue affects all major inference frameworks equally, suggesting it is not a framework-specific bug but a fundamental model architecture constraint.
When serving Gemma 4 E4B (4.5B effective parameters) on Modal L40S using Unsloth, inference throughput is only ~9-10 tokens/second despite having 48GB VRAM and low batch size. The same observation holds for vLLM and SGLang. Expected performance would be 30-50 tok/s for a 4B model. Model loads and forward passes complete, but generation speed is bottlenecked.
This appears to be an architectural issue with Gemma 4's heterogeneous attention head dimensions that prevent standard CUDA kernel optimizations from achieving expected throughput. The issue affects all major inference frameworks equally, suggesting it is not a framework-specific bug but a fundamental model architecture constraint.