Efficient Attention Mechanisms: Scalability vs Accuracy

Benchmarking five attention mechanisms across sequence lengths and tasks to map the Pareto frontier.

cs.CVTransformers5 MechanismsN up to 16K

0.951

Softmax Acc.

0.776

Linear Acc.

0.812

Performer Acc.

0.889

Sparse Acc.

0.872

MHLA Acc.

Benchmark at N=4096

Mechanism	Accuracy	Rel. Compute	Memory	Eff. Rank
Softmax	0.951	1.000	O(N²)	0.847
Linear	0.776	0.021	O(N)	0.312
Performer	0.812	0.043	O(N)	0.398
Sparse	0.889	0.157	O(N sqrt(N))	0.634
MHLA	0.872	0.084	O(N)	0.589

MHLA achieves 91.7% of Softmax accuracy at only 8.4% of compute cost, dominating the Pareto frontier among linear-complexity methods. The accuracy gap correlates strongly with effective attention rank (r=0.96).

Scalability--Accuracy Pareto Frontier

Points closer to the top-left corner represent better tradeoffs. MHLA (blue) achieves high accuracy with low compute, while Softmax (dark) has highest accuracy but highest cost.

Effective Rank vs Accuracy

Accuracy vs Sequence Length

Relative Compute vs Sequence Length

Accuracy Gap Sources

Method Complexity

Mechanism	Time Complexity	Memory	Key Idea
Softmax	O(N²d)	O(N²)	Exact pairwise attention
Linear	O(Nd²)	O(N)	Kernel decomposition
Performer	O(Nrd)	O(N)	Random feature approximation
Sparse	O(N sqrt(N) d)	O(N sqrt(N))	Fixed stride + local window
MHLA	O(Nhd)	O(N)	Token-level multi-head linear