A Principled Evaluation Framework for Mechanistic Interpretability Localization Methods
Mechanistic interpretability lacks unified benchmarks for comparing localization methods
Mechanistic interpretability (MI) aims to identify model components causally responsible for specific behaviors in neural networks. However, the field lacks unified benchmarks for comparing localization methods or verifying that identified components are causally optimal. Different methods often disagree, and there is no principled way to determine which identification is most accurate.
Zhang et al. (2026) identified that developing principled and reproducible evaluation frameworks remains an open challenge for MI.
Five metrics combined via weighted harmonic mean to prevent single-metric gaming.
Permutation testing to assess agreement beyond chance without ground truth.
Synthetic models with known ground-truth circuits for objective evaluation.
Five complementary dimensions measuring localization quality
Fraction of behavior preserved when only identified components are active
Fraction of behavior destroyed when identified components are ablated
How selective the identification is: 1 - |S| / |C|
Mean pairwise Jaccard similarity across seed perturbations
Fraction of components surviving greedy subset reduction
Four methods evaluated on a 4-layer, 4-head synthetic transformer (20 components, 5 ground-truth)
Measures marginal behavior drop for each component. 5 TP, 2 FP, 0 FN.
Noisy gradient-based approximation. 5 TP, 5 FP, 0 FN. Prone to false positives.
Systematic component removal. 5 TP, 11 FP, 0 FN. Extensive over-identification.
Greedy iterative pruning. 4 TP, 0 FP, 1 FN. Perfect precision.
Select a method to see which transformer components were identified
Comprehensive multi-metric evaluation on the 4L/4H synthetic transformer benchmark
| Method | |S| | Faithfulness | Completeness | Minimality | Stability | COS | Precision | Recall | F1 | Composite |
|---|---|---|---|---|---|---|---|---|---|---|
| Act. Patching | 7 | 1.000 | 0.992 | 0.650 | 0.417 | 0.571 | 0.714 | 1.000 | 0.833 | 0.650 |
| Grad. Attribution | 10 | 1.000 | 0.977 | 0.500 | 0.533 | 0.400 | 0.500 | 1.000 | 0.667 | 0.595 |
| Ablation Scan | 16 | 1.000 | 0.980 | 0.200 | 0.505 | 0.250 | 0.313 | 1.000 | 0.476 | 0.385 |
| Circuit Discovery | 4 | 0.978 | 0.899 | 0.800 | 1.000 | 1.000 | 1.000 | 0.800 | 0.889 | 0.929 |
Radar chart showing the multi-dimensional evaluation of each method
Permutation testing reveals statistically significant agreement (z = 3.75, p = 0.001)
4 of 5 ground-truth components
Exactly recovers all 5 ground-truth components
F1 scores and convergence z-scores across six architectural scales (6 to 156 components)
| Config | N | AP F1 | GA F1 | AS F1 | CD F1 | z-score | p-value |
|---|---|---|---|---|---|---|---|
| 2L/2H | 6 | 0.600 | 0.600 | 0.600 | 0.750 | 0.760 | 0.260 |
| 4L/4H | 20 | 0.833 | 0.667 | 0.476 | 0.889 | 4.710 | 0.000 |
| 6L/6H | 42 | 0.333 | 0.455 | 0.444 | 0.889 | 1.682 | 0.062 |
| 8L/8H | 72 | 0.213 | 0.286 | 0.189 | 0.889 | 2.325 | 0.014 |
| 10L/10H | 110 | 0.133 | 0.250 | 0.179 | 0.889 | 1.173 | 0.124 |
| 12L/12H | 156 | 0.071 | 0.175 | 0.111 | 0.889 | 2.606 | 0.002 |
How detection threshold affects performance metrics for different methods
Random subsets reveal the fundamental trade-off; ground truth and Circuit Discovery achieve near-optimal balance
Method ranking CD > AP > GA > AS preserved across all weight configurations
| Weights | AP | GA | AS | CD |
|---|---|---|---|---|
| Equal | 0.795 | 0.666 | 0.415 | 0.914 |
| Faith.-heavy | 0.829 | 0.725 | 0.500 | 0.933 |
| Minim.-heavy | 0.744 | 0.607 | 0.318 | 0.879 |
| COS-heavy | 0.713 | 0.559 | 0.349 | 0.946 |
| Faith.+Compl. | 0.840 | 0.733 | 0.496 | 0.931 |
Principal results from the CAUSAL-BENCH evaluation
With a composite score of 0.929, Circuit Discovery leads through perfect stability (1.000), perfect COS (1.000), and highest minimality (0.800). It identifies only 4 components, all in the ground truth.
Methods that identify more components achieve higher faithfulness but lower minimality and causal optimality. This fundamental trade-off is systematically exposed by the multi-metric framework.
The majority-vote set exactly recovers all 5 ground-truth components (z = 3.75, p = 0.001), demonstrating that cross-method agreement is a reliable signal even without ground truth.
The ranking CD > AP > GA > AS is preserved across all five weight configurations, with composite scores for CD ranging from 0.879 to 0.946, supporting the default equal-weight configuration.
F1 for CD remains constant at 0.889 across all model sizes (6-156 components), while threshold-based methods degrade significantly as the search space grows.
AP identifies all 5 ground-truth components (F1 = 0.833) with only 2 false positives, providing the best trade-off among methods with perfect recall.