Generalizability of LR Scaling Laws: MoE to Dense Transformers
Do learning rate configuration findings from MoE architectures transfer to dense Transformers?
Loss by Paradigm and Architecture
LR Error by Paradigm
Scaling Law Parameters
Architecture
c
alpha
beta
Fitting Loss
muTransfer Loss
Grid Search Loss