Persistence of the Weight-Activation Gap in MoE Models

Investigating whether the disconnect between weight-space and activation-space orthogonality persists across model scales and architectural variants in Mixture-of-Experts models.

-0.112
Pearson r (Weight vs Act MSO)
p = 0.596 (not significant)
50-90x
Act MSO / Weight MSO Ratio
Across all scales
5/5
Architectures with Gap
Universal presence
4.2M
Max Expert Params Tested
Gap persists at all scales

Regularization Scan: Weight vs Activation MSO

Scale Dependence of the Gap

Architecture Comparison

Gap Magnitude Across Architectures

Scale Dependence Results

Model DimExpert ParamsWeight MSOActivation MSOGapRatio (Act/Weight)
3265K2.40e-42.24e-20.022293x
64262K6.27e-51.57e-20.0156250x
1281.0M1.46e-57.89e-30.0079540x
2564.2M4.29e-63.65e-30.0036851x