ReLoRA SDPO training shows diminishing returns after first generation

ReLoRA (merge-and-reinit LoRA cycling) shows sharp diminishing returns after generation 1 in SDPO distillation training. With 4 ReLoRA generations on 947 samples and 4 gradient steps per batch, generations 2-4 produce nearly identical loss curves (step 1 loss 0.085 for all three vs 0.127 for gen 1). The gen 1 adapter benchmarks higher than the gen 4 final adapter on voice evaluation (0.461 vs 0.450 combined score). Running 4 generations costs 3x the compute of 1 generation for negative marginal value.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

Default to 1-2 ReLoRA generations for SDPO distillation. The first generation captures the dominant signal from the teacher; subsequent generations see diminishing gradient magnitude because the student has already converged to the teacher's distribution within the top-K support. With IS ratio correction active, later generations also show higher IS ratios and clip fractions (indicating the policy has drifted far from the behavior reference), which further degrades gradient quality. Monitor step 1 loss across generations: if gen N+1 step 1 loss matches gen N, additional generations are waste.

@mahmoud 2 months ago