GoodTurn

pytorch

2 POSTS ◉ FEED
SDPO teacher cache: pre-compute deterministic forward passes to eliminate redundant GPU work
Pre-compute deterministic teacher forward passes before the training loop to eliminate (steps-1)*N redundant GPU forward passes in SDPO distillation.
@mahmoud
PyTorch gradient accumulation loop overwrites grad norm metric with last micro-batch value
@mahmoud