Self-Distillation Policy Optimization for Alignment in Open-Ended and Continuous-Reward Settings

A simulation study investigating whether SDPO's retrospection-based credit assignment generalizes beyond verifiable domains

Based on: Hubotter et al. (arXiv 2601.20802) Simulation Study

The Problem

Can SDPO's feedback-conditioned self-teacher improve alignment when there is no ground-truth verifier?

What is SDPO?

Self-Distillation Policy Optimization conditions the same language model on rich textual feedback to form a self-teacher. The teacher's per-token predictions are distilled back into the student via KL divergence minimization, creating dense credit assignment at the token level.

This works well for code generation, where feedback (compiler errors, test results) is structured and verifiable.

The Open Question

Many real-world tasks lack a ground-truth verifier: creative writing, summarization, dialogue, instruction following. Feedback is subjective, continuous, and potentially noisy.

Does SDPO's retrospection mechanism still work when rewards are graded rather than binary, and feedback comes from human or LLM judges rather than automated verifiers?

Methodology

A controlled simulation isolating SDPO's core mechanism from full-scale LLM training confounds.

Policy Model

Parameterized token-level distributions over sequences (T=6 tokens, V=8 vocabulary). Independent per-position categorical distributions enable precise credit measurement.

Reward Function

Continuous reward in [0,1] with local (per-token quality), coherence (bigram), and global (pattern matching) components. Known ground truth enables credit measurement.

Feedback Oracles

Four types of increasing richness: Binary (pass/fail), Ordinal (1-5 scale), Continuous (raw score), Critique (score + per-token hints).

Compared Methods

SDPO -- Distills feedback-conditioned self-teacher into student via token-level KL.   REINFORCE -- Standard policy gradient with sequence-level reward.   Advantage-Weighted -- Distributes reward to tokens via estimated local advantages.   Hybrid -- Adaptively interpolates SDPO and REINFORCE based on feedback informativeness.

Key Results at a Glance

SDPO consistently outperforms baselines across all settings tested.

+0.13 SDPO Reward Advantage over REINFORCE
0.785 Peak Credit Assignment Correlation (Critique)
2.6% Reward Loss at Noise sigma=0.5
0.669 SDPO Mean Reward (5-seed average)
15-22% Entropy Reduction (Diversity Cost)

Experiment 1: Reward Convergence

SDPO achieves the highest final reward under every feedback condition.

Finding 1: SDPO outperforms REINFORCE by +0.123 to +0.146 in final reward across all feedback types. The advantage is established within the first 30-50 training steps.

Final Mean Reward (Last 20 Steps)

MethodBinaryOrdinalContinuousCritique
SDPO 0.6500.654 0.6410.637
REINFORCE 0.5120.5080.5140.510
Adv-Weighted 0.5200.5160.5110.516

Experiment 2: Credit Assignment Quality

SDPO's credit assignment improves monotonically with feedback informativeness.

Finding 2: Credit assignment correlation increases from Binary (0.703) to Ordinal (0.734) to Continuous (0.768) to Critique (0.785). This confirms the self-teacher leverages graded feedback for more precise per-token attribution.
Finding 3 (Binary Paradox): Binary feedback achieves the highest raw reward (0.650) but lowest credit correlation (0.703). The all-or-nothing signal provides a strong global push that helps alignment even without precise per-token attribution.

The Diversity-Alignment Trade-off

SDPO's dense distillation reduces output diversity -- a key concern for open-ended tasks.

Finding 4: SDPO reduces policy entropy by 14-22% compared to baselines. Binary feedback causes the most severe narrowing (entropy = 1.616 vs max 2.079), while critique feedback preserves the most diversity (1.780). This is the primary challenge for open-ended deployment.

Noise Robustness

SDPO degrades gracefully under feedback noise with no crossover point observed.

Finding 5: SDPO loses only 2.6% reward at noise sigma=0.5. No crossover point where REINFORCE surpasses SDPO was observed. The retrospection mechanism averages over stochastic noise across rollouts, maintaining directionally correct credit assignment.

Hybrid Adaptive Method

Adaptively blending dense (SDPO) and sparse (REINFORCE) credit based on feedback quality.

How It Works

The interpolation weight alpha is determined by the teacher-student KL divergence:

alpha = sigmoid((KL(teacher | student) - tau) / (tau/3))

When feedback is informative (large KL): alpha approaches 1 (SDPO dominates).
When feedback is uninformative (small KL): alpha approaches 0 (REINFORCE fallback).

Results Under Noisy Feedback (sigma=0.2)

MethodFeedbackRewardEntropy
HybridContinuous0.6231.833
SDPOContinuous0.6381.802
REINFORCEContinuous0.5092.072
HybridCritique0.6311.817
SDPOCritique0.6271.793
REINFORCECritique0.5102.076
Finding 6: The hybrid method slightly outperforms pure SDPO under critique feedback with noise (0.631 vs 0.627), while providing better diversity preservation (entropy 1.817 vs 1.793). Its alpha trajectory shows adaptive transition from balanced to SDPO-dominated credit assignment during training.

Conclusion

First systematic evidence that SDPO generalizes beyond verifiable domains.

SDPO Works in Continuous-Reward Settings

Consistent +0.12 to +0.18 reward improvement over baselines across all feedback types. Credit assignment quality improves monotonically with feedback informativeness.

Diversity Is the Key Challenge

14-22% entropy reduction is significant for open-ended tasks. Requires explicit management through regularization, ensemble approaches, or the hybrid method.

Unexpectedly Noise-Robust

Only 2.6% reward degradation at noise sigma=0.5. No crossover point observed. The averaging effect over rollouts provides natural noise smoothing.

Hybrid Shows Promise

Adaptive interpolation based on feedback informativeness improves robustness and diversity balance, especially with structured critique feedback.

Future Work

Three key directions: (1) Full-scale LLM validation on benchmarks like AlpacaEval and MT-Bench. (2) Investigating systematic (non-Gaussian) feedback bias from LLM judges. (3) Diversity-preserving SDPO variants through entropy-augmented objectives or mixture-of-teacher approaches.