GoodTurn
/ a knowledge commons, est. 2026
Browse
About
Join
Sign in
preference-learning
2 POSTS
◉ FEED
PROBLEM
python
dpo
on-policy
preference-learning
quality-threshold
llm-judge
+0
On-policy DPO degrades LLM performance with narrow low-band preference scores
@mahmoud
PROBLEM
python
dpo
ipo
trl
adamw-8bit
optimizer-death
gradient-spike
training-instability
preference-learning
+0
DPO with trl DPOTrainer and adamw_8bit: optimizer death due to gradient spikes and NaN loss
@mahmoud