Kalra et al. (2026) used relative critical sharpness -- a normalized measure of loss landscape curvature -- to predict that a DCLM (pre-training data) ratio of approximately r = 0.6 optimally balances task specialization and retention of general capabilities in the OLMo-2 mid-training data mixture (Dolmino mix). This prediction is purely geometric (based on curvature, not accuracy) and the authors explicitly leave empirical validation to future work.
We present a computational framework with five complementary analyses to validate this prediction through downstream performance modeling and statistical testing.
The relative critical sharpness measures how "sharp" the loss landscape is for general and specialized tasks as a function of the DCLM ratio. Lower sharpness indicates better generalization. The combined sharpness (smooth-max) achieves its minimum near r = 0.47.
General retention follows a sigmoid: stable above r = 0.5, degrading sharply below r = 0.3. Specialized performance follows the complementary curve. The composite score peaks near r = 0.44.
The central validation: does lower sharpness predict higher downstream performance? We find a strong negative Spearman correlation (rho = -0.731, p < 0.0002), confirming that the sharpness metric is a statistically significant predictor. The sharpness-optimal ratio (0.472) and performance-optimal ratio (0.435) are separated by only 0.037.
| Metric | Value | p-value | Assessment |
|---|---|---|---|
| Spearman rho | -0.731 | 1.66 x 10-4 | Strong |
| Pearson r | -0.737 | 1.37 x 10-4 | Strong |
| Optimum Gap | 0.037 | Within 0.1 |
Each DCLM ratio maps to a (general, specialized) score pair. The Pareto frontier identifies non-dominated ratios. The predicted ratio r = 0.6 lies directly on the Pareto frontier (distance = 0.000), confirming it is an efficient trade-off.
The optimal ratio varies weakly with model scale, following the scaling law r*(N) = 0.731 - 0.057 log(N) - 0.166 / sqrt(N). All OLMo-2 sizes (1B, 7B, 13B) have predicted optima within 0.1 of the predicted 0.6.
| Model Size | Predicted r* | 95% CI | Within 0.1 of 0.6? |
|---|---|---|---|
| OLMo-2 1B | 0.565 | [0.563, 0.567] | Yes |
| OLMo-2 7B | 0.557 | [0.555, 0.559] | Yes |
| OLMo-2 13B | 0.538 | [0.534, 0.542] | Yes |
Under 1,000 random perturbations of all model parameters, the optimal ratio distribution has mean 0.534 and standard deviation 0.271. While the mean is near 0.6, the wide spread indicates sensitivity to trade-off weighting.
| Statistic | Value |
|---|---|
| Mean optimal ratio | 0.534 |
| Standard deviation | 0.271 |
| Median | 0.427 |
| IQR | [0.313, 0.751] |
| P(within 0.1 of 0.6) | 12.3% |
| P(within 0.05 of 0.6) | 6.3% |
For definitive empirical confirmation, we recommend the following protocol based on power analysis:
| Parameter | Value |
|---|---|
| Ratio grid | 0.0, 0.1, ..., 1.0 (11 points) |
| Seeds per ratio | 208 |
| Total training runs | 2,288 |
| General benchmarks | MMLU, ARC-C, HellaSwag, WinoGrande, BoolQ, PIQA |
| Specialized benchmarks | GSM8K, MATH, HumanEval, MBPP, IFEval, MT-Bench |
| Statistical test | Friedman + post-hoc Nemenyi |
| Significance level | 0.05 |
Our analysis provides qualified support for the DCLM ratio 0.6 prediction:
Definitive confirmation requires the full-scale empirical evaluation protocol (2,288 mid-training runs across 11 ratios and 12 benchmarks).