CAUSAL-BENCH

A Principled Evaluation Framework for Mechanistic Interpretability Localization Methods

Track: Datasets Category: CL Mechanistic Interpretability
4
Methods Evaluated
5
Evaluation Metrics
6
Model Scales
0.929
Best Composite Score

The Open Problem

Mechanistic interpretability lacks unified benchmarks for comparing localization methods

Problem Statement

Mechanistic interpretability (MI) aims to identify model components causally responsible for specific behaviors in neural networks. However, the field lacks unified benchmarks for comparing localization methods or verifying that identified components are causally optimal. Different methods often disagree, and there is no principled way to determine which identification is most accurate.

Zhang et al. (2026) identified that developing principled and reproducible evaluation frameworks remains an open challenge for MI.

Our Solution: Three Pillars

1
Multi-Metric Scoring

Five metrics combined via weighted harmonic mean to prevent single-metric gaming.

2
Cross-Method Convergence

Permutation testing to assess agreement beyond chance without ground truth.

3
Planted-Circuit Benchmarks

Synthetic models with known ground-truth circuits for objective evaluation.

Evaluation Metrics

Five complementary dimensions measuring localization quality

F

Faithfulness

Fraction of behavior preserved when only identified components are active

C

Completeness

Fraction of behavior destroyed when identified components are ablated

M

Minimality

How selective the identification is: 1 - |S| / |C|

S

Stability

Mean pairwise Jaccard similarity across seed perturbations

COS

Causal Optimality

Fraction of components surviving greedy subset reduction

Localization Methods

Four methods evaluated on a 4-layer, 4-head synthetic transformer (20 components, 5 ground-truth)

Activation Patching

7
Components Identified

Measures marginal behavior drop for each component. 5 TP, 2 FP, 0 FN.

Gradient Attribution

10
Components Identified

Noisy gradient-based approximation. 5 TP, 5 FP, 0 FN. Prone to false positives.

Ablation Scanning

16
Components Identified

Systematic component removal. 5 TP, 11 FP, 0 FN. Extensive over-identification.

Circuit Discovery

4
Components Identified

Greedy iterative pruning. 4 TP, 0 FP, 1 FN. Perfect precision.

Component Identification Map

Select a method to see which transformer components were identified

True Positive False Positive False Negative True Negative

Main Evaluation Results

Comprehensive multi-metric evaluation on the 4L/4H synthetic transformer benchmark

Method|S|FaithfulnessCompletenessMinimalityStabilityCOSPrecisionRecallF1Composite
Act. Patching7 1.0000.9920.6500.4170.5710.7141.0000.8330.650
Grad. Attribution10 1.0000.9770.5000.5330.4000.5001.0000.6670.595
Ablation Scan16 1.0000.9800.2000.5050.2500.3131.0000.4760.385
Circuit Discovery4 0.9780.8990.8001.0001.0001.0000.8000.8890.929

Precision, Recall & F1

Components Identified vs Ground Truth (5)

Multi-Metric Profiles

Radar chart showing the multi-dimensional evaluation of each method

Metric Breakdown

Cross-Method Convergence

Permutation testing reveals statistically significant agreement (z = 3.75, p = 0.001)

Null Distribution vs Observed

Pairwise Jaccard Similarity

3.75
Z-Score (p = 0.001)

Consensus Set (All 4 methods)

  • L0.attn_head[0]
  • L0.attn_head[1]
  • L1.mlp
  • L2.attn_head[0]

4 of 5 ground-truth components

Majority Set (>2 methods)

  • L0.attn_head[0]
  • L0.attn_head[1]
  • L1.mlp
  • L2.attn_head[0]
  • L3.mlp

Exactly recovers all 5 ground-truth components

Scalability Across Model Sizes

F1 scores and convergence z-scores across six architectural scales (6 to 156 components)

F1 Score vs Model Size

Convergence Z-Score vs Model Size

Scalability Results Table

ConfigNAP F1GA F1AS F1CD F1z-scorep-value
2L/2H60.6000.6000.6000.7500.7600.260
4L/4H200.8330.6670.4760.8894.7100.000
6L/6H420.3330.4550.4440.8891.6820.062
8L/8H720.2130.2860.1890.8892.3250.014
10L/10H1100.1330.2500.1790.8891.1730.124
12L/12H1560.0710.1750.1110.8892.6060.002

Threshold Sensitivity Analysis

How detection threshold affects performance metrics for different methods

Activation Patching: Metrics vs Threshold

Gradient Attribution: Metrics vs Threshold

Faithfulness vs Minimality Trade-off

Random subsets reveal the fundamental trade-off; ground truth and Circuit Discovery achieve near-optimal balance

Composite Score Robustness

Method ranking CD > AP > GA > AS preserved across all weight configurations

Composite Scores by Weight Config

Weight Configurations

WeightsAPGAASCD
Equal0.7950.6660.4150.914
Faith.-heavy0.8290.7250.5000.933
Minim.-heavy0.7440.6070.3180.879
COS-heavy0.7130.5590.3490.946
Faith.+Compl.0.8400.7330.4960.931

Key Findings

Principal results from the CAUSAL-BENCH evaluation

Finding 1

Circuit Discovery Achieves Best Overall Score

With a composite score of 0.929, Circuit Discovery leads through perfect stability (1.000), perfect COS (1.000), and highest minimality (0.800). It identifies only 4 components, all in the ground truth.

Finding 2

Faithfulness-Minimality Trade-off

Methods that identify more components achieve higher faithfulness but lower minimality and causal optimality. This fundamental trade-off is systematically exposed by the multi-metric framework.

Finding 3

Cross-Method Convergence Works

The majority-vote set exactly recovers all 5 ground-truth components (z = 3.75, p = 0.001), demonstrating that cross-method agreement is a reliable signal even without ground truth.

Finding 4

Rankings Robust to Metric Weighting

The ranking CD > AP > GA > AS is preserved across all five weight configurations, with composite scores for CD ranging from 0.879 to 0.946, supporting the default equal-weight configuration.

Finding 5

Circuit Discovery Scales Best

F1 for CD remains constant at 0.889 across all model sizes (6-156 components), while threshold-based methods degrade significantly as the search space grows.

Finding 6

Activation Patching Offers Best Recall Balance

AP identifies all 5 ground-truth components (F1 = 0.833) with only 2 false positives, providing the best trade-off among methods with perfect recall.