Reliable Hyperparameter Transfer Across Scales

Transferring optimal LR, initialization, and weight decay from small proxies to large LLMs

0.149
Adaptive TE at 7B
100%
Training Stability
56%
Error Reduction vs muP
5
Scales Tested

Transfer Error Across Scales

Training Stability

Loss Ratio (Transferred / Optimal)

Optimal LR Scaling Law

Key Findings