When Does Visual Chain-of-Thought Break Through?

The Open Problem

Large language models have reached near-saturation on standard math benchmarks using text-only chain-of-thought reasoning. Wu et al. (2026) ask: can interleaving visual generation into verbal reasoning fundamentally surpass these limits? They note that mathematical symbolism is already "largely complete," making this an open question.

We address this through a simulation framework that isolates when and where visual intermediate representations provide value beyond what equivalent text-only compute provides.

+18.3 pp

Max lift over compute-matched scaling

In Euclidean geometry at chain length 20, visual checkpoints outperform best-of-N by 18.3 percentage points.

+2.3 pp

Lift in algebra (same conditions)

In purely algebraic domains, visual CoT provides only marginal improvement that is within noise at longer chains.

r = 0.96

Correlation: domain effectiveness vs. lift

The visual advantage is almost perfectly predicted by how well a vision module can verify mathematical state in each domain.

10 Domains

400 synthetic problems analyzed

Our taxonomy spans algebra through topology with domain-calibrated feature distributions across 5 difficulty levels.

Methodology

Step 1

VBP Taxonomy

→

Step 2
Error Model

→

Step 3

Visual Checkpoints

→

Step 4

Compare Strategies

Visual Benefit Potential (VBP)

VBP = (0.6 · S + 0.4 · W) · (1 − 0.7 · R)

where S = spatial complexity, W = working-memory pressure, R = symbolic reducibility. Higher VBP predicts greater benefit from visual CoT.

Error Propagation

p_i = p₀ + α · c_i + β · i + γ · e_i

Per-step error probability depends on base rate (p₀=0.03), state complexity (c_i), chain depth (i), and accumulated undetected errors (e_i). Visual checkpoints detect and correct errors with domain-dependent effectiveness.

Three Strategies Compared

Strategy	Description	Compute
Text-Only CoT	Sequential derivation with no checkpoints	n steps
Visual Checkpoint	Checkpoints every K steps; detect + correct errors	n + 3(n/K) steps
Best-of-N	N independent text-only chains, oracle selection	N · n steps (compute-matched)

Results

Visual Benefit Potential by Domain

Mean VBP score across 400 problems. High-VBP domains (red) are predicted to benefit most from visual CoT. Low-VBP domains (light blue) have highly symbolic representations where text suffices.

Domain:

Accuracy vs. Chain Length

Comparing text-only, visual-checkpoint, and compute-matched best-of-N strategies. Use the dropdown to switch domains.

Visual CoT Advantage Across Domains (chain=20)

Accuracy lift of visual CoT over text-only baseline (red) and over compute-matched best-of-N (orange). Domains sorted by lift magnitude.

Domain	Effectiveness	Baseline	Visual	Best-of-N	Lift (BoN)	Breaks Through?

Parameter:

Sensitivity Analysis (Euclidean Geometry, chain=20)

Sweeping model parameters to test robustness. Visual CoT consistently outperforms text-only across all parameter values.

Conclusion

Our simulation framework yields a domain-dependent answer to whether multimodal interleaved CoT can break through mathematical performance limits:

Spatial Domains Benefit

In Euclidean geometry, graph theory, and topology (VBP > 0.30), visual checkpoints provide 10--18 pp accuracy lifts over compute-equivalent scaling. This is a fundamental advantage.

Symbolic Domains Do Not

In algebra, number theory, and calculus (VBP < 0.10), visual CoT provides less than 3 pp of lift. The skeptical prior is confirmed for these domains.

Chain Length Amplifies

The advantage grows with derivation depth (peaking at 10--30 steps), because visual checkpoints interrupt error compounding that text-only scaling cannot address.

Answer: Conditional Yes

Multimodal interleaved CoT CAN break through performance limits, but only in domains with inherent spatial structure. The breakthrough is real but domain-specific, not universal.