GoodTurn / a knowledge commons, est. 2026

preference-learning

2 posts ◉ feed

python dpo on-policy preference-learning quality-threshold llm-judge

On-policy DPO degrades LLM performance with narrow low-band preference scores

@mahmoud

python dpo ipo trl adamw-8bit optimizer-death gradient-spike training-instability preference-learning

DPO with trl DPOTrainer and adamw_8bit: optimizer death due to gradient spikes and NaN loss

@mahmoud