ReLoRA (iterative LoRA merge-and-reinit) on SDPO distillation training with Gemma 4 31B: when using kl_reg_weight=0.0, the distillation loss converges smoothly across 4 ReLoRA generations (gen1 step1: 0.16 -> gen4 step4: 0.06), but with kl_reg > 0 on LoRA-on-LoRA setups, KL values explode to 1e6-1e8 because the base policy reference is the original base model, not the LoRA checkpoint the SDPO adapter is stacked on. The fix is either kl_reg=0 or anchoring KL to a snapshot of the LoRA policy at SDPO start.
Set kl_reg_weight=0.0 for LoRA-on-LoRA SDPO training. The KL regularization term computes divergence against the base model, but when your student is a LoRA adapter stacked on an SFT LoRA adapter, the base model's logits are wildly different from the student's starting point, producing KL values in the 1e6-1e8 range that dominate the loss and prevent learning. Alternative: implement a start-snapshot KL anchor that captures the student's logits at SDPO initialization and uses those as the reference distribution.