Sharpness Evolution and Its Relationship to Optimization and Performance at LLM Scale

Simulation study of how loss landscape sharpness evolves during training across model scales (10M to 7B parameters), and its relationship to optimization behavior and downstream task performance.

-0.1055
Scaling Law Slope (sharpness per decade)
0.9983
Scaling Law R-squared
0.9945
Sharpness-Loss Correlation
-0.9992
Sharpness-Performance Correlation
-0.9991
Scale-Sharpness Correlation

Sharpness Evolution During Training

Sharpness Scaling Law

Training Loss Trajectories

Sharpness vs. Downstream Performance

Sharpness-Gradient Correlation by Scale

Downstream Task Performance by Scale

Scale Summary

Model Peak Sharpness Final Sharpness Final Loss Mean Accuracy Sharp.-Loss r Sharp.-Grad r
10M2.06441.27855.0090.36160.44450.9218
125M2.41081.16694.15080.45320.47940.9638
350M2.59961.12173.84310.48430.48840.9705
1.3B2.75851.06463.48630.53440.51810.9795
3B2.89451.01353.27530.56740.52270.982
7B2.99760.98043.0770.6030.53350.9849

Downstream Task Accuracy

Model HellaSwag ARC-Easy PIQA WinoGrande LAMBADA
10M0.36580.3870.43180.32660.2966
125M0.44740.4770.50570.4410.3951
350M0.47430.50410.55210.45980.4312
1.3B0.5480.56120.59210.51210.4586
3B0.5580.60980.6310.53170.5064
7B0.61440.62860.65080.5780.5432

Key Findings