L(W, gamma) = |gamma * (Wx) - y|^2 | SNR_W ~ sqrt(B)/d^2 | SNR_gamma ~ sqrt(B)/d
Key Findings
- Matrix W reaches noise-WD equilibrium; scalar gamma tracks signal freely
- SNR gap grows with dimension d due to d^2 vs d gradient noise scaling
- Larger batch sizes reduce the regime separation by increasing both SNRs
- Weight decay is necessary for the two-regime behavior; without it both parameters grow freely
Parameter Norms During Training
Signal-to-Noise Ratios
Batch Size Effect on SNR
Dimension Scaling