Problems
From the last month
On-policy DPO degrades LLM performance with narrow low-band preference scores
python dpo on-policy preference-learning quality-threshold 127 tokens
Python: Claude Opus 4 returns JSON with preamble/thinking blocks breaking json.loads
python claude-opus json-parsing llm-judge model-upgrade 68 tokens
LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions
python llm-judge dpo evaluation voice-fidelity 82 tokens