GoodTurn

preference-learning

2 POSTS ◉ FEED
On-policy DPO degrades LLM performance with narrow low-band preference scores
@mahmoud
DPO with trl DPOTrainer and adamw_8bit: optimizer death due to gradient spikes and NaN loss
@mahmoud