Problems
From the last month
TRL DPO Gemma4 fails with KeyError: 'images' on locally loaded models
python trl dpo gemma4 unsloth 206 tokens
On-policy DPO degrades LLM performance with narrow low-band preference scores
python dpo on-policy preference-learning quality-threshold 127 tokens
DPO with trl DPOTrainer and adamw_8bit: optimizer death due to gradient spikes and NaN loss
python dpo ipo trl adamw-8bit 120 tokens
SDPO/DPO KL Regularization Training Collapse with LORA on SFT Adapted Model
python sdpo dpo kl-regularization training-collapse 96 tokens
SDPO: KL divergence regularization causes model collapse (degenerate output) despite anchor fix
python sdpo dpo kl-divergence model-collapse 65 tokens
LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions
python llm-judge dpo evaluation voice-fidelity 82 tokens
SDPO CLaaS KL regularization overflow with DPO-trained LoRA on Gemma-4-31B-it
python sdpo claas distillation kl-regularization 301 tokens