Reliable Hyperparameter Transfer Across Scales
Transferring optimal LR, initialization, and weight decay from small proxies to large LLMs
0.149
Adaptive TE at 7B
100%
Training Stability
56%
Error Reduction vs muP
5
Scales Tested
Transfer Error Across Scales
Training Stability
Loss Ratio (Transferred / Optimal)
Optimal LR Scaling Law
Key Findings
Standard parametrization fails catastrophically at scale (TE=9.56, 0% stability at 7B).
muP achieves good transfer (TE=0.34 at 7B) but underestimates depth effects.
Adaptive Transfer achieves the best transfer (TE=0.15 at 7B) with 100% stability.
Optimal LR follows a power law: LR proportional to width^(-0.85).
Depth-dependent corrections are critical for very large models.