Three complementary theoretical frameworks establishing rigorous guarantees for the SDFT in-context assumption
Self-Distillation Fine-Tuning (SDFT) assumes that conditioning a foundation model on an expert demonstration produces a teacher policy that approximates the optimal next policy under a trust-region-regularized objective. The trust-region optimal policy is:
Two conditions must hold: Claim A (near-optimality in reward) and Claim B (minimal KL deviation from the current policy). The SDFT paper states these conditions "cannot be verified theoretically." We provide the first formal justification.
Under an exponential family model of the pretraining distribution, the demonstration-conditioned policy converges to the trust-region optimal:
where d is the sufficient statistic dimension, λ0 is prior precision, and n is the number of demonstrations.
With probability ≥ 1 - δ over the demonstration:
Distribution-free bound holding for any bounded reward function.
The reward gap and KL excess decompose exactly:
Both SDFT claims follow from bounding the single quantity KL(πdemo | π*).
Adjust the trust-region coefficient β to see how different teacher policies compare.
The variational gap is primarily governed by the ICL approximation quality (σ) rather than the trust-region coefficient (β). For σ ≤ 0.1, the gap remains below 0.01 across all β values.
| n | KL(demo|opt) | Theory | Ratio | Reward Gap | KL Excess |
|---|---|---|---|---|---|
| 1 | 0.2775 | 1.2500 | 0.22 | 0.0346 | 0.2429 |
| 3 | 0.0859 | 0.6250 | 0.14 | -0.0930 | 0.1788 |
| 8 | 0.0815 | 0.2778 | 0.29 | -0.1132 | 0.1947 |
| 20 | 0.0831 | 0.1190 | 0.70 | -0.1144 | 0.1975 |
| 50 | 0.0834 | 0.0490 | 1.70 | -0.1146 | 0.1981 |
| 100 | 0.0836 | 0.0248 | 3.37 | -0.1148 | 0.1984 |
| 200 | 0.0837 | 0.0124 | 6.73 | -0.1148 | 0.1985 |
| neff | Bound | Actual Gap | Tightness | Violation |
|---|---|---|---|---|
| 3 | 0.8962 | 0.0129 | 0.014 | 0.1% |
| 8 | 0.5713 | 0.0052 | 0.009 | 0.0% |
| 15 | 0.4283 | 0.0029 | 0.007 | 0.0% |
| 50 | 0.2049 | -0.0004 | -0.002 | 0.0% |
| 100 | 0.1487 | 0.0004 | 0.003 | 0.0% |
| 200 | 0.1080 | 0.0002 | 0.002 | 0.0% |
| 500 | 0.0851 | 0.0004 | 0.004 | 0.0% |
We provided the first rigorous theoretical justification for the SDFT in-context assumption through three complementary frameworks:
The variational gap KL(πdemo | π*) emerges as the single key quantity: bounding it simultaneously establishes near-optimality and minimal deviation. The ICL-conditioned teacher achieves 91.5% of optimal trust-region value with only 0.042 nats KL distance to π*.