A Multi-Model Scaling Analysis with Uncertainty Quantification for On-Policy Self-Distillation
Based on Zhao et al. (arXiv: 2601.18734, Jan 2026)| Model | Parameters (k) | Chi-squared | AIC | BIC | Weight | Pred. at 70B (pp) |
|---|---|---|---|---|---|---|
| Power Law | 2 | 0.55 | 4.55 | 3.76 | 0.338 | 32.9 ± 6.0 |
| Saturating | 2 | 0.73 | 4.73 | 3.95 | 0.309 | 9.1 ± 1.2 |
| Sigmoid | 3 | 0.02 | 6.02 | 4.85 | 0.162 | 15.7 ± 9.7 |
| Sqrt-Log | 3 | 0.06 | 6.06 | 4.88 | 0.159 | 17.2 ± 2.8 |
| Logarithmic | 2 | 5.27 | 9.27 | 8.49 | 0.032 | 12.2 ± 0.9 |
| Model Averaged | - | - | - | - | 1.000 | 19.6 ± 11.3 |
The OPSD gain is decomposed into three mechanistic components: distribution match (on-policy advantage), dark knowledge transfer, and implicit regularization.
| Component | Parameter | Value | Interpretation |
|---|---|---|---|
| Distribution Match | α | 0.232 | Scaling coefficient |
| β | 0.950 | Nearly linear growth | |
| Dark Knowledge | γ | 0.100 | Max gain (pp) |
| Nchar | 11.5B | Saturation scale | |
| Regularization | δ | 3.080 | Max regularization benefit |
| η | 0.833 | Growth rate |
Bootstrap analysis (1,000 resamples) quantifies the combined uncertainty from data noise, model parameters, and model selection.
Information-theoretic analysis identifies which model sizes would provide the most discriminating evidence between scaling regimes.