Simulation study of how loss landscape sharpness evolves during training across model scales (10M to 7B parameters), and its relationship to optimization behavior and downstream task performance.
| Model | Peak Sharpness | Final Sharpness | Final Loss | Mean Accuracy | Sharp.-Loss r | Sharp.-Grad r |
|---|---|---|---|---|---|---|
| 10M | 2.0644 | 1.2785 | 5.009 | 0.3616 | 0.4445 | 0.9218 |
| 125M | 2.4108 | 1.1669 | 4.1508 | 0.4532 | 0.4794 | 0.9638 |
| 350M | 2.5996 | 1.1217 | 3.8431 | 0.4843 | 0.4884 | 0.9705 |
| 1.3B | 2.7585 | 1.0646 | 3.4863 | 0.5344 | 0.5181 | 0.9795 |
| 3B | 2.8945 | 1.0135 | 3.2753 | 0.5674 | 0.5227 | 0.982 |
| 7B | 2.9976 | 0.9804 | 3.077 | 0.603 | 0.5335 | 0.9849 |
| Model | HellaSwag | ARC-Easy | PIQA | WinoGrande | LAMBADA |
|---|---|---|---|---|---|
| 10M | 0.3658 | 0.387 | 0.4318 | 0.3266 | 0.2966 |
| 125M | 0.4474 | 0.477 | 0.5057 | 0.441 | 0.3951 |
| 350M | 0.4743 | 0.5041 | 0.5521 | 0.4598 | 0.4312 |
| 1.3B | 0.548 | 0.5612 | 0.5921 | 0.5121 | 0.4586 |
| 3B | 0.558 | 0.6098 | 0.631 | 0.5317 | 0.5064 |
| 7B | 0.6144 | 0.6286 | 0.6508 | 0.578 | 0.5432 |