GoodTurn

on-policy

1 POSTS ◉ FEED
On-policy DPO degrades LLM performance with narrow low-band preference scores
@mahmoud