A simulation study investigating whether SDPO's retrospection-based credit assignment generalizes beyond verifiable domains
Can SDPO's feedback-conditioned self-teacher improve alignment when there is no ground-truth verifier?
Self-Distillation Policy Optimization conditions the same language model on rich textual feedback to form a self-teacher. The teacher's per-token predictions are distilled back into the student via KL divergence minimization, creating dense credit assignment at the token level.
This works well for code generation, where feedback (compiler errors, test results) is structured and verifiable.
Many real-world tasks lack a ground-truth verifier: creative writing, summarization, dialogue, instruction following. Feedback is subjective, continuous, and potentially noisy.
Does SDPO's retrospection mechanism still work when rewards are graded rather than binary, and feedback comes from human or LLM judges rather than automated verifiers?
A controlled simulation isolating SDPO's core mechanism from full-scale LLM training confounds.
Parameterized token-level distributions over sequences (T=6 tokens, V=8 vocabulary). Independent per-position categorical distributions enable precise credit measurement.
Continuous reward in [0,1] with local (per-token quality), coherence (bigram), and global (pattern matching) components. Known ground truth enables credit measurement.
Four types of increasing richness: Binary (pass/fail), Ordinal (1-5 scale), Continuous (raw score), Critique (score + per-token hints).
SDPO -- Distills feedback-conditioned self-teacher into student via token-level KL. REINFORCE -- Standard policy gradient with sequence-level reward. Advantage-Weighted -- Distributes reward to tokens via estimated local advantages. Hybrid -- Adaptively interpolates SDPO and REINFORCE based on feedback informativeness.
SDPO consistently outperforms baselines across all settings tested.
SDPO achieves the highest final reward under every feedback condition.
| Method | Binary | Ordinal | Continuous | Critique |
|---|---|---|---|---|
| SDPO | 0.650 | 0.654 | 0.641 | 0.637 |
| REINFORCE | 0.512 | 0.508 | 0.514 | 0.510 |
| Adv-Weighted | 0.520 | 0.516 | 0.511 | 0.516 |
SDPO's credit assignment improves monotonically with feedback informativeness.
SDPO's dense distillation reduces output diversity -- a key concern for open-ended tasks.
SDPO degrades gracefully under feedback noise with no crossover point observed.
Adaptively blending dense (SDPO) and sparse (REINFORCE) credit based on feedback quality.
The interpolation weight alpha is determined by the teacher-student KL divergence:
alpha = sigmoid((KL(teacher | student) - tau) / (tau/3))
When feedback is informative (large KL): alpha approaches 1 (SDPO dominates).
When feedback is uninformative (small KL): alpha approaches 0 (REINFORCE fallback).
| Method | Feedback | Reward | Entropy |
|---|---|---|---|
| Hybrid | Continuous | 0.623 | 1.833 |
| SDPO | Continuous | 0.638 | 1.802 |
| REINFORCE | Continuous | 0.509 | 2.072 |
| Hybrid | Critique | 0.631 | 1.817 |
| SDPO | Critique | 0.627 | 1.793 |
| REINFORCE | Critique | 0.510 | 2.076 |
First systematic evidence that SDPO generalizes beyond verifiable domains.
Consistent +0.12 to +0.18 reward improvement over baselines across all feedback types. Credit assignment quality improves monotonically with feedback informativeness.
14-22% entropy reduction is significant for open-ended tasks. Requires explicit management through regularization, ensemble approaches, or the hybrid method.
Only 2.6% reward degradation at noise sigma=0.5. No crossover point observed. The averaging effect over rollouts provides natural noise smoothing.
Adaptive interpolation based on feedback informativeness improves robustness and diversity balance, especially with structured critique feedback.
Three key directions: (1) Full-scale LLM validation on benchmarks like AlpacaEval and MT-Bench. (2) Investigating systematic (non-Gaussian) feedback bias from LLM judges. (3) Diversity-preserving SDPO variants through entropy-augmented objectives or mixture-of-teacher approaches.