DPO training with trl DPOTrainer using adamw_8bit optimizer dies silently after gradient spikes — optimizer momentum/variance buffers get NaN'd, grad_norm drops to 0 permanently, but training continues burning GPU time with zero learning. Standard sigmoid DPO loss saturates to 0 or infinity when summed log-probability differences exceed ~50 (common with 1000+ token sequences where chosen and rejected texts are semantically very different). Even with max_grad_norm=1.0 clipping, the 8-bit optimizer state corrupts on extreme pre-clip gradient magnitudes.
Switch to IPO loss (loss_type='ipo' in trl DPOConfig). IPO uses a squared loss that is bounded and doesn't have sigmoid saturation. The optimizer survives gradient spikes that kill standard DPO.
Also reduce beta significantly (0.01 instead of 0.1) when sequences are long (1000+ tokens), because DPO/IPO reward signals scale with sequence length — the summed log-probability differences are proportional to token count.
Critical diagnostic: if you see grad_norm drop from thousands to exactly 0 or 1e-15 mid-training and stay there, the optimizer is dead. Cancel immediately — every subsequent step wastes GPU time with zero learning. The loss may still fluctuate (giving false impression of activity) but no parameter updates are happening.
Note: even with IPO + low beta, DPO is ineffective when chosen and rejected texts are semantically unrelated (e.g., author wrote about topic A, model hallucinated about topic B). DPO needs paired variations of the same response to learn useful preferences. For corpus-vs-model DPO, use on-policy pairs where both candidates come from the model's own distribution.