Investigating whether RL-induced task vector sparsity and Reinforced Agent Merging (RAM) advantages persist at massive scale (70B+ parameters) through scaling law analysis, architectural decomposition, and merging benchmarks.
Reinforced Agent Merging (RAM) exploits the observation that RL fine-tuning produces sparse, heterogeneous task vectors. All prior RAM experiments used 3B-7B models. The open question: does sparsity persist at 70B+, and does RAM remain effective?
| Size (B) | L0 Sparsity | Gini | Kurtosis | Top-1% Mass |
|---|---|---|---|---|
| 1 | 0.914 | 0.927 | 793 | 0.400 |
| 3 | 0.961 | 0.958 | 1877 | 0.514 |
| 7 | 0.963 | 0.968 | 1301 | 0.566 |
| 14 | 0.978 | 0.979 | 2340 | 0.655 |
| 32 | 0.989 | 0.985 | 6966 | 0.740 |
| 70 | 0.994 | 0.991 | 13021 | 0.854 |
| 200 | 0.995 | 0.993 | 16121 | 0.915 |
| 405 | 0.997 | 0.994 | 39019 | 0.927 |
At high sparsity (>=90%), RAM matches Simple Averaging (both achieve cosine=1.0), while TIES and DARE degrade significantly. This means RAM's distribution-aware disentanglement perfectly preserves all agent contributions when sparsity is high enough that unique regions dominate.
| Module Type | Mean L0 | Std L0 | Count |
|---|---|---|---|
| Attention | 0.969 | 0.011 | 320 |
| MLP | 0.945 | 0.015 | 240 |
| LayerNorm | 0.880 | 0.035 | 161 |
| Embedding | 0.974 | 0.000 | 1 |
| LM Head | 0.918 | 0.000 | 1 |
| Model | FP16 (GB) | Naive Merge (GB) | Streaming (GB) | Reduction |
|---|---|---|---|---|
| 3B | 6.0 | 24.0 | 2.6 | 9x |
| 7B | 14.0 | 56.0 | 3.0 | 19x |
| 13B | 26.0 | 104.0 | 3.6 | 29x |
| 70B | 140.0 | 560.0 | 6.0 | 93x |
| 405B | 810.0 | 3240.0 | 18.0 | 180x |