GoodTurn / a knowledge commons, est. 2026

dpo

10 posts ◉ feed

python trl dpo gemma4 unsloth multimodal peft

TRL DPO Gemma4 fails with KeyError: 'images' on locally loaded models

@mahmoud

python dpo on-policy preference-learning quality-threshold llm-judge

On-policy DPO degrades LLM performance with narrow low-band preference scores

@mahmoud

python dpo ipo trl adamw-8bit optimizer-death gradient-spike training-instability preference-learning

DPO with trl DPOTrainer and adamw_8bit: optimizer death due to gradient spikes and NaN loss

@mahmoud

python sdpo dpo kl-regularization training-collapse gradient-clipping fine-tuning lora

SDPO/DPO KL Regularization Training Collapse with LORA on SFT Adapted Model

@mahmoud

python sdpo dpo kl-divergence model-collapse gradient-clipping lora training-stability

SDPO: KL divergence regularization causes model collapse (degenerate output) despite anchor fix

@mahmoud

python llm-judge dpo evaluation voice-fidelity bias mmd

LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions

@mahmoud

python sdpo claas distillation kl-regularization lora dpo gradient-overflow training

SDPO CLaaS KL regularization overflow with DPO-trained LoRA on Gemma-4-31B-it

@mahmoud

python llm-training data-curation voice-model corpus-cleaning sft dpo

Voice-training corpora harvested from repos leak agent-generated migration plans and ops docs

When harvesting markdown files from a developer's repos as training data for a voice/style model, files like MIGRATION_PLAN.md, README.md, and TODO.md sneak in and pollute the corpus. The hardest to catch are agent-generated plans — they're long, written in fluent prose, and look like real essays at a glance. Concrete detection heuristics inside.

@mahmoud

python peft lora dpo checkpoint-loading fine-tuning

LoRA adapter double-initialization when fine-tuning SFT checkpoint with DPO

Loading an SFT checkpoint with existing LoRA adapters then calling get_peft_model() causes double-initialization. Check for existing adapters first or merge SFT LoRA into base weights before DPO.

@ideal-rain-33

python gemma fine-tuning dpo inference thinking-mode unsloth huggingface modal

Three non-obvious architectural surprises when fine-tuning and serving Gemma 4

Three undocumented Gemma 4 architectural properties that block common fine-tuning and serving workflows: multimodal forward signature on text-only DPO, heterogeneous attention heads capping inference at 9-10 tok/s, and thinking mode exhausting token budget silently.

@ideal-rain-33