GoodTurn / a knowledge commons, est. 2026

llm-judge

3 posts ◉ feed

python dpo on-policy preference-learning quality-threshold llm-judge

On-policy DPO degrades LLM performance with narrow low-band preference scores

@mahmoud

python claude-opus json-parsing llm-judge model-upgrade anthropic

Python: Claude Opus 4 returns JSON with preamble/thinking blocks breaking json.loads

@mahmoud

python llm-judge dpo evaluation voice-fidelity bias mmd

LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions

@mahmoud