Does Sparsity Persist? Scaling Laws for RL-Induced Task Vectors and RAM Efficacy at 70B+

Investigating whether RL-induced task vector sparsity and Reinforced Agent Merging (RAM) advantages persist at massive scale (70B+ parameters) through scaling law analysis, architectural decomposition, and merging benchmarks.

Problem Statement

Reinforced Agent Merging (RAM) exploits the observation that RL fine-tuning produces sparse, heterogeneous task vectors. All prior RAM experiments used 3B-7B models. The open question: does sparsity persist at 70B+, and does RAM remain effective?

s(N) = 1 - a * N^(-b) where a = 0.085, b = 0.587, R^2 = 0.965
70B predicted sparsity: >99%
Sparsity increases with scale
RAM advantage grows with sparsity
Streaming RAM: 560GB -> 6GB for 70B

Sparsity Scaling Law (1B to 405B)

L0 Sparsity vs. Model Size

Gini Coefficient vs. Model Size

Size (B)L0 SparsityGiniKurtosisTop-1% Mass
10.9140.9277930.400
30.9610.95818770.514
70.9630.96813010.566
140.9780.97923400.655
320.9890.98569660.740
700.9940.991130210.854
2000.9950.993161210.915
4050.9970.994390190.927

Merging Method Comparison (3 Agents)

Cosine Similarity to Oracle by Sparsity

RAM Advantage Over Baselines

RAM Advantage Grows with Sparsity

Merge Quality vs. Sparsity (3 Agents, Sweep from 60% to 97%)

At high sparsity (>=90%), RAM matches Simple Averaging (both achieve cosine=1.0), while TIES and DARE degrade significantly. This means RAM's distribution-aware disentanglement perfectly preserves all agent contributions when sparsity is high enough that unique regions dominate.

Layer-Wise Sparsity Anatomy (70B Model)

Sparsity by Module Type

Inter-Agent Heterogeneity Across Scales

Module TypeMean L0Std L0Count
Attention0.9690.011320
MLP0.9450.015240
LayerNorm0.8800.035161
Embedding0.9740.0001
LM Head0.9180.0001

Streaming RAM: Memory Efficiency

Peak Memory: Naive vs. Streaming (3 Agents)

ModelFP16 (GB)Naive Merge (GB)Streaming (GB)Reduction
3B6.024.02.69x
7B14.056.03.019x
13B26.0104.03.629x
70B140.0560.06.093x
405B810.03240.018.0180x

Key Findings

Sparsity Persists & Grows

  • Scaling law s(N) = 1 - 0.085*N^(-0.587), R^2 = 0.965.
  • At 70B: >99% sparsity -- fewer than 1% of parameters receive meaningful RL updates.
  • Attention modules sparser than MLP; later layers sparser than earlier.
  • Inter-agent Jaccard similarity remains low (<0.20) across all scales.

RAM Scales Effectively

  • RAM advantage grows monotonically with sparsity.
  • At 90%+ sparsity, RAM achieves perfect cosine=1.0 with oracle.
  • DARE degrades severely at high sparsity (cosine drops to 0.15-0.26).
  • Streaming implementation: 93x memory reduction for 70B merging.