Benchmarking five attention mechanisms across sequence lengths and tasks to map the Pareto frontier.
| Mechanism | Accuracy | Rel. Compute | Memory | Eff. Rank |
|---|---|---|---|---|
| Softmax | 0.951 | 1.000 | O(N2) | 0.847 |
| Linear | 0.776 | 0.021 | O(N) | 0.312 |
| Performer | 0.812 | 0.043 | O(N) | 0.398 |
| Sparse | 0.889 | 0.157 | O(N sqrt(N)) | 0.634 |
| MHLA | 0.872 | 0.084 | O(N) | 0.589 |
| Mechanism | Time Complexity | Memory | Key Idea |
|---|---|---|---|
| Softmax | O(N2d) | O(N2) | Exact pairwise attention |
| Linear | O(Nd2) | O(N) | Kernel decomposition |
| Performer | O(Nrd) | O(N) | Random feature approximation |
| Sparse | O(N sqrt(N) d) | O(N sqrt(N)) | Fixed stride + local window |
| MHLA | O(Nhd) | O(N) | Token-level multi-head linear |