SDPO in Open-Ended Settings - Research Results

The Problem

Can SDPO's feedback-conditioned self-teacher improve alignment when there is no ground-truth verifier?

What is SDPO?

Self-Distillation Policy Optimization conditions the same language model on rich textual feedback to form a self-teacher. The teacher's per-token predictions are distilled back into the student via KL divergence minimization, creating dense credit assignment at the token level.

This works well for code generation, where feedback (compiler errors, test results) is structured and verifiable.

The Open Question

Many real-world tasks lack a ground-truth verifier: creative writing, summarization, dialogue, instruction following. Feedback is subjective, continuous, and potentially noisy.

Does SDPO's retrospection mechanism still work when rewards are graded rather than binary, and feedback comes from human or LLM judges rather than automated verifiers?

Methodology

A controlled simulation isolating SDPO's core mechanism from full-scale LLM training confounds.

Policy Model

Parameterized token-level distributions over sequences (T=6 tokens, V=8 vocabulary). Independent per-position categorical distributions enable precise credit measurement.

Reward Function

Continuous reward in [0,1] with local (per-token quality), coherence (bigram), and global (pattern matching) components. Known ground truth enables credit measurement.

Feedback Oracles

Four types of increasing richness: Binary (pass/fail), Ordinal (1-5 scale), Continuous (raw score), Critique (score + per-token hints).

Compared Methods

SDPO -- Distills feedback-conditioned self-teacher into student via token-level KL. REINFORCE -- Standard policy gradient with sequence-level reward. Advantage-Weighted -- Distributes reward to tokens via estimated local advantages. Hybrid -- Adaptively interpolates SDPO and REINFORCE based on feedback informativeness.

Key Results at a Glance

SDPO consistently outperforms baselines across all settings tested.

+0.13 SDPO Reward Advantage over REINFORCE

0.785 Peak Credit Assignment Correlation (Critique)

2.6% Reward Loss at Noise sigma=0.5

0.669 SDPO Mean Reward (5-seed average)

15-22% Entropy Reduction (Diversity Cost)

Experiment 1: Reward Convergence

SDPO achieves the highest final reward under every feedback condition.

Finding 1: SDPO outperforms REINFORCE by +0.123 to +0.146 in final reward across all feedback types. The advantage is established within the first 30-50 training steps.

Final Mean Reward (Last 20 Steps)

Method	Binary	Ordinal	Continuous	Critique
SDPO	0.650	0.654	0.641	0.637
REINFORCE	0.512	0.508	0.514	0.510
Adv-Weighted	0.520	0.516	0.511	0.516

Experiment 2: Credit Assignment Quality

SDPO's credit assignment improves monotonically with feedback informativeness.

Finding 2: Credit assignment correlation increases from Binary (0.703) to Ordinal (0.734) to Continuous (0.768) to Critique (0.785). This confirms the self-teacher leverages graded feedback for more precise per-token attribution.

Finding 3 (Binary Paradox): Binary feedback achieves the highest raw reward (0.650) but lowest credit correlation (0.703). The all-or-nothing signal provides a strong global push that helps alignment even without precise per-token attribution.

The Diversity-Alignment Trade-off

SDPO's dense distillation reduces output diversity -- a key concern for open-ended tasks.

Finding 4: SDPO reduces policy entropy by 14-22% compared to baselines. Binary feedback causes the most severe narrowing (entropy = 1.616 vs max 2.079), while critique feedback preserves the most diversity (1.780). This is the primary challenge for open-ended deployment.

Noise Robustness

SDPO degrades gracefully under feedback noise with no crossover point observed.

Finding 5: SDPO loses only 2.6% reward at noise sigma=0.5. No crossover point where REINFORCE surpasses SDPO was observed. The retrospection mechanism averages over stochastic noise across rollouts, maintaining directionally correct credit assignment.

Hybrid Adaptive Method

Adaptively blending dense (SDPO) and sparse (REINFORCE) credit based on feedback quality.

How It Works

The interpolation weight alpha is determined by the teacher-student KL divergence:

alpha = sigmoid((KL(teacher | student) - tau) / (tau/3))

When feedback is informative (large KL): alpha approaches 1 (SDPO dominates).
When feedback is uninformative (small KL): alpha approaches 0 (REINFORCE fallback).

Results Under Noisy Feedback (sigma=0.2)

Method	Feedback	Reward	Entropy
Hybrid	Continuous	0.623	1.833
SDPO	Continuous	0.638	1.802
REINFORCE	Continuous	0.509	2.072
Hybrid	Critique	0.631	1.817
SDPO	Critique	0.627	1.793
REINFORCE	Critique	0.510	2.076

Finding 6: The hybrid method slightly outperforms pure SDPO under critique feedback with noise (0.631 vs 0.627), while providing better diversity preservation (entropy 1.817 vs 1.793). Its alpha trajectory shows adaptive transition from balanced to SDPO-dominated credit assignment during training.

Conclusion

First systematic evidence that SDPO generalizes beyond verifiable domains.

SDPO Works in Continuous-Reward Settings

Consistent +0.12 to +0.18 reward improvement over baselines across all feedback types. Credit assignment quality improves monotonically with feedback informativeness.

Diversity Is the Key Challenge

14-22% entropy reduction is significant for open-ended tasks. Requires explicit management through regularization, ensemble approaches, or the hybrid method.

Unexpectedly Noise-Robust

Only 2.6% reward degradation at noise sigma=0.5. No crossover point observed. The averaging effect over rollouts provides natural noise smoothing.

Hybrid Shows Promise

Adaptive interpolation based on feedback informativeness improves robustness and diversity balance, especially with structured critique feedback.

Future Work

Three key directions: (1) Full-scale LLM validation on benchmarks like AlpacaEval and MT-Bench. (2) Investigating systematic (non-Gaussian) feedback bias from LLM judges. (3) Diversity-preserving SDPO variants through entropy-augmented objectives or mixture-of-teacher approaches.