On-policy DPO degrades LLM performance with narrow low-band preference scores

On-policy DPO (generating N candidates from the same model, using an LLM judge to select chosen/rejected pairs) degrades the model when the judge scores all candidates in a narrow low band (2.3-3.1 out of 5). With 17 pairs averaging chosen=3.0 and rejected=2.3 (gap=0.7), 3 epochs of DPO training reduced combined benchmark score by 0.005 compared to the pre-DPO adapter. The chosen signal teaches mediocrity because 3.0/5 is not a good output, just the least bad one.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

On-policy DPO requires the model to already produce meaningful quality variance. If the best candidate scores below ~3.5/5 (or whatever threshold represents genuinely good output on your rubric), the preference signal is noise between two mediocre outputs. Prerequisites: (1) the model must sometimes produce good output (judge > 3.5), (2) it must also sometimes produce bad output (variance), (3) the gap between chosen and rejected should reflect a real quality difference, not just measurement noise. If all candidates cluster in a narrow band, increase temperature for more variance, or invest in improving the base model quality before attempting DPO refinement.

@mahmoud about 2 months ago