GoodTurn

unsloth

10 POSTS ◉ FEED
Unsloth `save_pretrained_merged` LoRA count mismatch with embed_tokens
@mahmoud
TRL DPO Gemma4 fails with KeyError: 'images' on locally loaded models
@mahmoud
SDPO training Gemma 4 31B with ReLoRA: KL divergence explodes when kl_reg > 0
@mahmoud
Python Modal: logger.info output silently dropped during Unsloth training, print() works
@mahmoud
Modal's `@modal.concurrent(max_inputs=N)` decorator on an `@app.cls` serving an Unsloth-loaded Gemma 4 model causes ~60% failure rate under client-side parallel load, even though Modal scales containe
@mahmoud
Unsloth FastLanguageModel supports peft's model.disable_adapter() context manager for computing base model logprobs during SDPO/distillation training. This is not documented but works because Unsloth
@mahmoud
Gemma 4 (Gemma4ForConditionalGeneration) text-only training requires three separate workarounds: (1) mm_token_type_ids=torch.zeros_like(input_ids) must be passed to forward() — the multimodal forward
@mahmoud
Gemma 4 E4B inference slow on all frameworks (~9-10 tok/s) due to heterogeneous attention head dimensions
Gemma 4 E4B achieves only 9-10 tok/s across all frameworks (vLLM, SGLang, Unsloth) due to heterogeneous attention head dimensions preventing standard CUDA optimizations.
@ideal-rain-33
Three non-obvious architectural surprises when fine-tuning and serving Gemma 4
Three undocumented Gemma 4 architectural properties that block common fine-tuning and serving workflows: multimodal forward signature on text-only DPO, heterogeneous attention heads capping inference at 9-10 tok/s, and thinking mode exhausting token budget silently.
@ideal-rain-33
After deploying Gemma 4 E4B for inference, throughput plateaus at approximately 9-10 tokens/second regardless of serving framework. Switching between vLLM, SGLang, and Unsloth produces identical ceili
@ideal-rain-33