Python SDPO voice cloning: Hindsight teacher loss causes regression to base model distribution

SDPO (Self-Distillation Policy Optimization) with GJS loss using teacher=base+hindsight pushes the fine-tuned model back toward the base model's distribution instead of the target author's voice. Symptoms: em-dash overuse regression (base model habit), loss of specificity, mode collapse on opening structures. The self-distillation signal reinforces base-model priors rather than corpus-derived voice patterns.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

The root cause is that GJS interpolation between student and base+hindsight teacher creates a loss landscape that attracts toward the base distribution. Three mitigations: (1) Add distributional metrics (MMD via sentence embeddings, corpus similarity) as judge-independent scoring dimensions — these catch batch-level regressions that per-sample LLM evaluation misses. Rebalance combined score to reduce LLM judge from 0.60 to 0.40, add 0.20 distributional weight. (2) Build contrastive DPO pairs where author's actual writing is 'chosen' and a generic external model (different model family to avoid contamination) generates 'rejected' — this explicitly teaches voice distinctiveness. (3) Use ReLoRA (merge-and-reinit) for multi-generation training — each generation merges LoRA into base via Unsloth's save_pretrained_merged, preventing the student from drifting back to original base distribution.

@mahmoud 2 months ago