Does Sparsity Persist? Scaling Laws for RL-Induced Task Vectors at 70B+

Problem Statement

Reinforced Agent Merging (RAM) exploits the observation that RL fine-tuning produces sparse, heterogeneous task vectors. All prior RAM experiments used 3B-7B models. The open question: does sparsity persist at 70B+, and does RAM remain effective?

s(N) = 1 - a * N^(-b) where a = 0.085, b = 0.587, R^2 = 0.965

70B predicted sparsity: >99%

Sparsity increases with scale

RAM advantage grows with sparsity

Streaming RAM: 560GB -> 6GB for 70B

Sparsity Scaling Law (1B to 405B)

L0 Sparsity vs. Model Size

Gini Coefficient vs. Model Size

Size (B)	L0 Sparsity	Gini	Kurtosis	Top-1% Mass
1	0.914	0.927	793	0.400
3	0.961	0.958	1877	0.514
7	0.963	0.968	1301	0.566
14	0.978	0.979	2340	0.655
32	0.989	0.985	6966	0.740
70	0.994	0.991	13021	0.854
200	0.995	0.993	16121	0.915
405	0.997	0.994	39019	0.927

Merging Method Comparison (3 Agents)

Cosine Similarity to Oracle by Sparsity

RAM Advantage Over Baselines

RAM Advantage Grows with Sparsity

Merge Quality vs. Sparsity (3 Agents, Sweep from 60% to 97%)

At high sparsity (>=90%), RAM matches Simple Averaging (both achieve cosine=1.0), while TIES and DARE degrade significantly. This means RAM's distribution-aware disentanglement perfectly preserves all agent contributions when sparsity is high enough that unique regions dominate.

Layer-Wise Sparsity Anatomy (70B Model)

Sparsity by Module Type

Inter-Agent Heterogeneity Across Scales

Module Type	Mean L0	Std L0	Count
Attention	0.969	0.011	320
MLP	0.945	0.015	240
LayerNorm	0.880	0.035	161
Embedding	0.974	0.000	1
LM Head	0.918	0.000	1

Streaming RAM: Memory Efficiency

Peak Memory: Naive vs. Streaming (3 Agents)

Model	FP16 (GB)	Naive Merge (GB)	Streaming (GB)	Reduction
3B	6.0	24.0	2.6	9x
7B	14.0	56.0	3.0	19x
13B	26.0	104.0	3.6	29x
70B	140.0	560.0	6.0	93x
405B	810.0	3240.0	18.0	180x

Key Findings

Sparsity Persists & Grows

Scaling law s(N) = 1 - 0.085*N^(-0.587), R^2 = 0.965.
At 70B: >99% sparsity -- fewer than 1% of parameters receive meaningful RL updates.
Attention modules sparser than MLP; later layers sparser than earlier.
Inter-agent Jaccard similarity remains low (<0.20) across all scales.

RAM Scales Effectively

RAM advantage grows monotonically with sparsity.
At 90%+ sparsity, RAM achieves perfect cosine=1.0 with oracle.
DARE degrades severely at high sparsity (cosine drops to 0.15-0.26).
Streaming implementation: 93x memory reduction for 70B merging.