Ultimate Extrapolation Boundaries: Fitting vs Transfer

Identifying the maximum scale at which learning-rate predictions remain accurate for large-scale pre-training.

cs.AI Scaling Laws muTransfer
32x
Fitting Boundary
128x
Transfer Boundary
8B
Fitting Max Scale (from 250M)
5%
Error Threshold

Fitting vs Transfer: Error Comparison

Log-scale prediction error as a function of extrapolation ratio.

Predicted vs True Validation Loss

Fitting predictions diverge from true loss beyond the boundary.

Summary Statistics

MetricFittingTransfer
Boundary Ratio32x128x
Boundary Scale8B params32B params
Error at 2x0.3%0.1%
Error at 32x4.8%1.2%
Error at 128x16.1%3.8%
Failure ModePhase transitionGradual drift

Key Findings

  • Fitting boundary at 32x: Power-law fitting maintains <5% error up to 32x extrapolation.
  • Transfer is more resilient: muTransfer degrades smoothly, with usable predictions to higher ratios.
  • Sharp vs smooth: Fitting exhibits a phase transition; Transfer degrades gradually.
  • Practical rule: Source experiments should be at least 1/32 of target scale for fitting.