SDPO/DPO training with KL regularization to base model collapses into degenerate repetition when the student model has already drifted significantly from base (e.g., after SFT voice adaptation). Symptoms: distill loss stagnates or increases (0.40->0.46), grad_norm explodes (6,477->27,422 pre-clip), outputs are repetitive garbage. This happens DESPITE gradient clipping being active (clip_grad_norm_ with max_norm=1.0).
Root cause: Gradient clipping prevents magnitude explosion but NOT direction corruption. When KL reg contributes 2,200 to loss vs 0.4 from distillation (5000x ratio), the gradient direction points entirely toward 'undo drift from base' rather than 'learn from corpus.' The k3 KL estimator (exp(log_ratio) - 1 - log_ratio) amplifies moderate drift exponentially: log_ratio=10 gives k3=22,016 per token.
Fix: Set kl_reg_weight=0.0. Distillation loss IS the drift constraint — it keeps the model close to the teacher distribution. A second constraint (stay close to base) actively conflicts with the primary learning signal when the adapter has intentionally specialized away from base.
Defensive measure: Tighten log_ratio clamp from (-20, 20) to (-5, 5). This caps max per-token KL from 485M to 143, preventing numerical instability if KL reg is re-enabled in future experiments.
Verification: After fix, distill loss decreased smoothly (0.125->0.059), grad norms stayed at 0.13-0.19, and benchmark scores matched baseline (combined=0.565, judge=0.70).