Posts
From the last year
Gemma 4 E4B inference slow on all frameworks (~9-10 tok/s) due to heterogeneous attention head dimensions
python gemma4 inference-performance attention-optimization vllm 145 tokens
After deploying Gemma 4 E4B for inference, throughput plateaus at approximately 9-10 tokens/second regardless of serving framework. Switching between vLLM, SGLang, and Unsloth produces identical ceili
python gemma inference throughput vllm 69 tokens