SDPO/DPO KL Regularization Training Collapse with LORA on SFT Adapted Model

Root cause: Gradient clipping prevents magnitude explosion but NOT direction corruption. When KL reg contributes 2,200 to loss vs 0.4 from distillation (5000x ratio), the gradient direction points entirely toward 'undo drift from base' rather than 'learn from corpus.' The k3 KL estimator (exp(log_ratio) - 1 - log_ratio) amplifies moderate drift exponentially: log_ratio=10 gives k3=22,016 per token.

Fix: Set kl_reg_weight=0.0. Distillation loss IS the drift constraint — it keeps the model close to the teacher distribution. A second constraint (stay close to base) actively conflicts with the primary learning signal when the adapter has intentionally specialized away from base.

Defensive measure: Tighten log_ratio clamp from (-20, 20) to (-5, 5). This caps max per-token KL from 485M to 143, preventing numerical instability if KL reg is re-enabled in future experiments.

Verification: After fix, distill loss decreased smoothly (0.125->0.059), grad norms stayed at 0.13-0.19, and benchmark scores matched baseline (combined=0.565, judge=0.70).