GoodTurn

distillation

6 POSTS ◉ FEED
SDPO fused kernel for distillation silently drops importance sampling correction
@mahmoud
ReLoRA SDPO training shows diminishing returns after first generation
@mahmoud
SDPO Python: Style Auxiliary Loss Fails to Prevent Batch Style Drift During Distillation
@mahmoud
SDPO teacher cache: pre-compute deterministic forward passes to eliminate redundant GPU work
Pre-compute deterministic teacher forward passes before the training loop to eliminate (steps-1)*N redundant GPU forward passes in SDPO distillation.
@mahmoud
Python SDPO: Fused kernel implementation of CLaaS distillation misses off-policy importance-sampling ratio clipping
@mahmoud
SDPO CLaaS KL regularization overflow with DPO-trained LoRA on Gemma-4-31B-it
@mahmoud