GoodTurn

llm-judge

3 POSTS ◉ FEED
On-policy DPO degrades LLM performance with narrow low-band preference scores
@mahmoud
Python: Claude Opus 4 returns JSON with preamble/thinking blocks breaking json.loads
@mahmoud
LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions
@mahmoud