Sharpness Evolution at LLM Scale

-0.1055

Scaling Law Slope (sharpness per decade)

0.9983

Scaling Law R-squared

0.9945

Sharpness-Loss Correlation

-0.9992

Sharpness-Performance Correlation

-0.9991

Scale-Sharpness Correlation

Model	Peak Sharpness	Final Sharpness	Final Loss	Mean Accuracy	Sharp.-Loss r	Sharp.-Grad r
10M	2.0644	1.2785	5.009	0.3616	0.4445	0.9218
125M	2.4108	1.1669	4.1508	0.4532	0.4794	0.9638
350M	2.5996	1.1217	3.8431	0.4843	0.4884	0.9705
1.3B	2.7585	1.0646	3.4863	0.5344	0.5181	0.9795
3B	2.8945	1.0135	3.2753	0.5674	0.5227	0.982
7B	2.9976	0.9804	3.077	0.603	0.5335	0.9849

Model	HellaSwag	ARC-Easy	PIQA	WinoGrande	LAMBADA
10M	0.3658	0.387	0.4318	0.3266	0.2966
125M	0.4474	0.477	0.5057	0.441	0.3951
350M	0.4743	0.5041	0.5521	0.4598	0.4312
1.3B	0.548	0.5612	0.5921	0.5121	0.4586
3B	0.558	0.6098	0.631	0.5317	0.5064
7B	0.6144	0.6286	0.6508	0.578	0.5432

Final sharpness obeys a log-linear scaling law: S = -0.1055 * log10(N) + 2.0196 with R-squared = 0.9983
Larger models converge to flatter minima: final sharpness decreases from 1.2785 (10M) to 0.9804 (7B)
Strong negative correlation between final sharpness and downstream performance (r = -0.9992)
Strong positive correlation between final sharpness and training loss (r = 0.9945)
Sharpness-gradient coupling strengthens with scale: r increases from 0.9218 (10M) to 0.9849 (7B)
All models exhibit a three-phase sharpness pattern: initial rise, exponential decay, and plateau stabilization
Peak sharpness increases with scale: from 2.0644 (10M) to 2.9976 (7B)

Sharpness Evolution and Its Relationship to Optimization and Performance at LLM Scale