Reliable Hyperparameter Transfer Across Scales

0.149

Adaptive TE at 7B

100%

Training Stability

56%

Error Reduction vs muP

Scales Tested

Standard parametrization fails catastrophically at scale (TE=9.56, 0% stability at 7B).
muP achieves good transfer (TE=0.34 at 7B) but underestimates depth effects.
Adaptive Transfer achieves the best transfer (TE=0.15 at 7B) with 100% stability.
Optimal LR follows a power law: LR proportional to width^(-0.85).
Depth-dependent corrections are critical for very large models.