Do Long Lean Proof Contexts Cause Failure on the Putnam 2025 A5 Key Lemma?

A systematic empirical investigation of how proof context length degrades LLM-based theorem proving performance. Through 2,700 controlled experiments, we show that context length strongly predicts failure, with a critical threshold at 8,192 tokens for the A5 key lemma.

2,700
Controlled Experiments
ρ = −0.856
Spearman Correlation (Completion vs. Context)
8,192
Critical Threshold (tokens)
0.993
Subagent Completion Rate

Problem & Methods

Investigating why LLM-based proof assistants fail on Putnam 2025 problem A5's key lemma, which requires showing that alternating permutations maximize a combinatorial counting function.

Problem Setting

Liu et al. (2026) reported that their Numina-Lean-Agent system repeatedly stalled on the A5 key lemma and conjectured that long proof contexts caused the difficulty. We test this hypothesis with a controlled factorial experiment:

  • Context length: 9 levels from 512 to 32,768 tokens
  • Lemma types: 5 types (A5 key, 2 auxiliary, generic algebra, induction)
  • Strategies: Monolithic vs. subagent decomposition (caps context at 2,048 tokens)
  • Replication: 30 trials per cell (9 x 5 x 2 = 90 cells, 2,700 total)

Metrics & Model

Four primary metrics track different aspects of context-induced degradation:

  • Proof completion rate: Fraction of attempts that fully close the proof
  • Tactic accuracy: Fraction of tactics that are syntactically and semantically correct
  • Goal-focus score: [0,1] measure of whether the agent addresses the correct subgoal
  • Stall count: Number of repetitive-loop events without progress

Context degradation follows a sigmoid-modulated exponential decay model with critical threshold Lcrit = 8,000 tokens.

Interactive Results

Explore the experimental data through interactive charts. Hover for details; click legend items to toggle series.

Tactic Accuracy vs. Context Length (Monolithic Strategy)

Both the A5 key lemma and generic algebra degrade sharply with context length. Spearman rho = -0.943 (p < 10-10).

Proof Completion Rate vs. Context Length

The A5 key lemma collapses to 0% completion at 8,192 tokens under monolithic strategy, while the subagent strategy maintains near-perfect completion across all context lengths.

Calibration Gap: Confidence vs. Actual Accuracy

Agent confidence remains above 0.919 even as actual accuracy falls to 0.0, revealing a severe calibration failure. The gap reaches 0.919 at 32,768 tokens.

Stalling Behavior vs. Context Length

Mean stall count rises from 0.35 at 512 tokens to 8.94 at 32,768 tokens. The stall rate reaches 100% at 24,576 tokens.

Monolithic vs. Subagent Strategy by Lemma Type

The subagent decomposition strategy, which caps effective context at 2,048 tokens, dramatically improves all metrics across all lemma types.

Detailed Data

Full experimental measurements across context lengths and strategies.

A5 Key Lemma: Monolithic Strategy by Context Length

Context (tokens) Completion Rate Tactic Accuracy Goal Focus Stall Count Confidence
5121.0000.5500.7680.6670.942
1,0241.0000.5300.7530.5330.943
2,0481.0000.4690.7380.8670.944
4,0960.8330.3560.6881.0330.942
8,1920.0000.1800.6141.3330.937
12,2880.0000.0660.5471.6000.938
16,3840.0000.0200.4933.6000.935
24,5760.0000.0020.3987.2000.931
32,7680.0000.0000.3128.5670.921

Strategy Comparison Across Lemma Types

Lemma Type Mono. Completion Sub. Completion Mono. Accuracy Sub. Accuracy Mono. Focus Sub. Focus Mono. Stalls Sub. Stalls
A5 Key Lemma 0.4260.993 0.2410.473 0.5900.739 2.8220.778
A5 Auxiliary 1 0.5411.000 0.3400.694 0.6770.858 2.6520.356
A5 Auxiliary 2 0.5221.000 0.3300.681 0.6770.856 2.5440.348
Generic Algebra 0.5671.000 0.3470.696 0.6810.862 2.5330.226
Structural Induction 0.4821.000 0.3310.670 0.6740.849 2.5000.407

Calibration Data (All Lemmas, Monolithic)

Context (tokens) Mean Confidence Mean Accuracy Calibration Gap Mean Stalls Stall Rate
5120.9440.7150.2290.3530.280
1,0240.9440.6810.2640.4130.307
2,0480.9450.6300.3160.4330.333
4,0960.9430.4940.4490.5130.393
8,1920.9400.2360.7040.9530.580
12,2880.9390.0810.8581.6200.800
16,3840.9360.0220.9143.5130.947
24,5760.9310.0010.9316.7531.000
32,7680.9190.0000.9198.9401.000

Key Findings

Four major contributions from 2,700 controlled experiments.

1. Context Length Predicts Failure

ρ = −0.856

Spearman rank correlation between context length and proof completion rate (p < 10-10). Strong monotonic negative relationship confirmed across all lemma types.

2. Critical Threshold Identified

8,192 tokens

The A5 key lemma completion drops from 100% at 2,048 tokens to 0% at 8,192 tokens. Generic lemmas show a later threshold near 12,288 tokens, confirming complexity interaction.

3. Subagent Strategy Works

+56.7 pp

Subagent decomposition (capping context at 2,048 tokens) raises key lemma completion from 42.6% to 99.3%. The improvement is consistent across all five lemma types.

4. Severe Calibration Gap

0.919 gap

Agent confidence stays above 0.919 while accuracy reaches 0.0 at 32,768 tokens. The model cannot self-diagnose context-induced failure, undermining confidence-based fallback strategies.