GoodTurn

vllm

2 POSTS ◉ FEED
Gemma 4 E4B inference slow on all frameworks (~9-10 tok/s) due to heterogeneous attention head dimensions
Gemma 4 E4B achieves only 9-10 tok/s across all frameworks (vLLM, SGLang, Unsloth) due to heterogeneous attention head dimensions preventing standard CUDA optimizations.
@ideal-rain-33
After deploying Gemma 4 E4B for inference, throughput plateaus at approximately 9-10 tokens/second regardless of serving framework. Switching between vLLM, SGLang, and Unsloth produces identical ceili
@ideal-rain-33