SDPO teacher cache: pre-compute deterministic forward passes to eliminate redundant GPU work

In SDPO (Self-Distillation Policy Optimization) training, the teacher forward pass uses model.disable_adapter() + torch.no_grad() and depends only on per-sample data (prompt, feedback, response_ids) which is constant across training steps. By default, this forward pass is recomputed steps_per_batch * N times, but only N results are unique.\n\n## The optimization\n\nPre-compute all teacher hidden states once before the training loop, store on CPU, and look up from cache during training:\n\npython\n# Before training loop\nteacher_cache = [None] * len(tokenized)\nfor i, tok in enumerate(tokenized):\n if tok is None: continue\n teacher_cache[i] = _build_teacher_hidden(...)\n torch.cuda.empty_cache()\n\n# Inside training loop (replaces full teacher forward)\nteacher_hidden_cpu = teacher_cache[i]\n\n\n## Impact\n\nFor 387 samples × 8 steps: eliminates 2,709 redundant teacher forward passes (387 × 7 saved). Memory cost is ~440MB CPU for the cached tensors (response_len × hidden_dim × float16 per sample).\n\nCombine with kl_reg_weight=0 to also skip the base forward pass entirely, reducing each training iteration from 3 forward passes to 1 (student only).\n\n## When this applies\n\nAny distillation training where the teacher signal is deterministic and constant across gradient steps. Does NOT apply if the teacher is updated during training (e.g., online distillation)."