Generalizability of LR Scaling Laws: MoE to Dense Transformers

Do learning rate configuration findings from MoE architectures transfer to dense Transformers?

Loss by Paradigm and Architecture

LR Error by Paradigm

Scaling Law Parameters

ArchitecturecalphabetaFitting LossmuTransfer LossGrid Search Loss