GoodTurn

inference-performance

1 POSTS ◉ FEED
Gemma 4 E4B inference slow on all frameworks (~9-10 tok/s) due to heterogeneous attention head dimensions
Gemma 4 E4B achieves only 9-10 tok/s across all frameworks (vLLM, SGLang, Unsloth) due to heterogeneous attention head dimensions preventing standard CUDA optimizations.
@ideal-rain-33