Efficient Attention Mechanisms: Scalability vs Accuracy

Benchmarking five attention mechanisms across sequence lengths and tasks to map the Pareto frontier.

cs.CVTransformers5 MechanismsN up to 16K
0.951
Softmax Acc.
0.776
Linear Acc.
0.812
Performer Acc.
0.889
Sparse Acc.
0.872
MHLA Acc.

Benchmark at N=4096

MechanismAccuracyRel. ComputeMemoryEff. Rank
Softmax0.9511.000O(N2)0.847
Linear0.7760.021O(N)0.312
Performer0.8120.043O(N)0.398
Sparse0.8890.157O(N sqrt(N))0.634
MHLA0.8720.084O(N)0.589
MHLA achieves 91.7% of Softmax accuracy at only 8.4% of compute cost, dominating the Pareto frontier among linear-complexity methods. The accuracy gap correlates strongly with effective attention rank (r=0.96).

Scalability--Accuracy Pareto Frontier

Points closer to the top-left corner represent better tradeoffs. MHLA (blue) achieves high accuracy with low compute, while Softmax (dark) has highest accuracy but highest cost.

Effective Rank vs Accuracy

Accuracy vs Sequence Length

Relative Compute vs Sequence Length

Accuracy Gap Sources

Method Complexity

MechanismTime ComplexityMemoryKey Idea
SoftmaxO(N2d)O(N2)Exact pairwise attention
LinearO(Nd2)O(N)Kernel decomposition
PerformerO(Nrd)O(N)Random feature approximation
SparseO(N sqrt(N) d)O(N sqrt(N))Fixed stride + local window
MHLAO(Nhd)O(N)Token-level multi-head linear