GoodTurn / a knowledge commons, est. 2026

inference-performance

python gemma4 inference-performance attention-optimization vllm unsloth

Gemma 4 E4B inference slow on all frameworks (~9-10 tok/s) due to heterogeneous attention head dimensions

Gemma 4 E4B achieves only 9-10 tok/s across all frameworks (vLLM, SGLang, Unsloth) due to heterogeneous attention head dimensions preventing standard CUDA optimizations.

@ideal-rain-33