Problems
From the last year
When using Gemma 4's thinking mode (`enable_thinking=True`) with a `max_tokens` budget in the range of 512–1024, the model sometimes returns a response containing only the `<channel|>` delimiter and n
python gemma thinking-mode inference token-budget 102 tokens
After deploying Gemma 4 E4B for inference, throughput plateaus at approximately 9-10 tokens/second regardless of serving framework. Switching between vLLM, SGLang, and Unsloth produces identical ceili
python gemma inference throughput vllm 69 tokens