A systematic empirical investigation of how proof context length degrades LLM-based theorem proving performance. Through 2,700 controlled experiments, we show that context length strongly predicts failure, with a critical threshold at 8,192 tokens for the A5 key lemma.
Investigating why LLM-based proof assistants fail on Putnam 2025 problem A5's key lemma, which requires showing that alternating permutations maximize a combinatorial counting function.
Liu et al. (2026) reported that their Numina-Lean-Agent system repeatedly stalled on the A5 key lemma and conjectured that long proof contexts caused the difficulty. We test this hypothesis with a controlled factorial experiment:
Four primary metrics track different aspects of context-induced degradation:
Context degradation follows a sigmoid-modulated exponential decay model with critical threshold Lcrit = 8,000 tokens.
Explore the experimental data through interactive charts. Hover for details; click legend items to toggle series.
Both the A5 key lemma and generic algebra degrade sharply with context length. Spearman rho = -0.943 (p < 10-10).
The A5 key lemma collapses to 0% completion at 8,192 tokens under monolithic strategy, while the subagent strategy maintains near-perfect completion across all context lengths.
Agent confidence remains above 0.919 even as actual accuracy falls to 0.0, revealing a severe calibration failure. The gap reaches 0.919 at 32,768 tokens.
Mean stall count rises from 0.35 at 512 tokens to 8.94 at 32,768 tokens. The stall rate reaches 100% at 24,576 tokens.
The subagent decomposition strategy, which caps effective context at 2,048 tokens, dramatically improves all metrics across all lemma types.
Full experimental measurements across context lengths and strategies.
| Context (tokens) | Completion Rate | Tactic Accuracy | Goal Focus | Stall Count | Confidence |
|---|---|---|---|---|---|
| 512 | 1.000 | 0.550 | 0.768 | 0.667 | 0.942 |
| 1,024 | 1.000 | 0.530 | 0.753 | 0.533 | 0.943 |
| 2,048 | 1.000 | 0.469 | 0.738 | 0.867 | 0.944 |
| 4,096 | 0.833 | 0.356 | 0.688 | 1.033 | 0.942 |
| 8,192 | 0.000 | 0.180 | 0.614 | 1.333 | 0.937 |
| 12,288 | 0.000 | 0.066 | 0.547 | 1.600 | 0.938 |
| 16,384 | 0.000 | 0.020 | 0.493 | 3.600 | 0.935 |
| 24,576 | 0.000 | 0.002 | 0.398 | 7.200 | 0.931 |
| 32,768 | 0.000 | 0.000 | 0.312 | 8.567 | 0.921 |
| Lemma Type | Mono. Completion | Sub. Completion | Mono. Accuracy | Sub. Accuracy | Mono. Focus | Sub. Focus | Mono. Stalls | Sub. Stalls |
|---|---|---|---|---|---|---|---|---|
| A5 Key Lemma | 0.426 | 0.993 | 0.241 | 0.473 | 0.590 | 0.739 | 2.822 | 0.778 |
| A5 Auxiliary 1 | 0.541 | 1.000 | 0.340 | 0.694 | 0.677 | 0.858 | 2.652 | 0.356 |
| A5 Auxiliary 2 | 0.522 | 1.000 | 0.330 | 0.681 | 0.677 | 0.856 | 2.544 | 0.348 |
| Generic Algebra | 0.567 | 1.000 | 0.347 | 0.696 | 0.681 | 0.862 | 2.533 | 0.226 |
| Structural Induction | 0.482 | 1.000 | 0.331 | 0.670 | 0.674 | 0.849 | 2.500 | 0.407 |
| Context (tokens) | Mean Confidence | Mean Accuracy | Calibration Gap | Mean Stalls | Stall Rate |
|---|---|---|---|---|---|
| 512 | 0.944 | 0.715 | 0.229 | 0.353 | 0.280 |
| 1,024 | 0.944 | 0.681 | 0.264 | 0.413 | 0.307 |
| 2,048 | 0.945 | 0.630 | 0.316 | 0.433 | 0.333 |
| 4,096 | 0.943 | 0.494 | 0.449 | 0.513 | 0.393 |
| 8,192 | 0.940 | 0.236 | 0.704 | 0.953 | 0.580 |
| 12,288 | 0.939 | 0.081 | 0.858 | 1.620 | 0.800 |
| 16,384 | 0.936 | 0.022 | 0.914 | 3.513 | 0.947 |
| 24,576 | 0.931 | 0.001 | 0.931 | 6.753 | 1.000 |
| 32,768 | 0.919 | 0.000 | 0.919 | 8.940 | 1.000 |
Four major contributions from 2,700 controlled experiments.
Spearman rank correlation between context length and proof completion rate (p < 10-10). Strong monotonic negative relationship confirmed across all lemma types.
The A5 key lemma completion drops from 100% at 2,048 tokens to 0% at 8,192 tokens. Generic lemmas show a later threshold near 12,288 tokens, confirming complexity interaction.
Subagent decomposition (capping context at 2,048 tokens) raises key lemma completion from 42.6% to 99.3%. The improvement is consistent across all five lemma types.
Agent confidence stays above 0.919 while accuracy reaches 0.0 at 32,768 tokens. The model cannot self-diagnose context-induced failure, undermining confidence-based fallback strategies.