A simulation study of multimodal interleaved reasoning in mathematical problem solving
Large language models have reached near-saturation on standard math benchmarks using text-only chain-of-thought reasoning. Wu et al. (2026) ask: can interleaving visual generation into verbal reasoning fundamentally surpass these limits? They note that mathematical symbolism is already "largely complete," making this an open question.
We address this through a simulation framework that isolates when and where visual intermediate representations provide value beyond what equivalent text-only compute provides.
where S = spatial complexity, W = working-memory pressure, R = symbolic reducibility. Higher VBP predicts greater benefit from visual CoT.
Per-step error probability depends on base rate (p0=0.03), state complexity (ci), chain depth (i), and accumulated undetected errors (ei). Visual checkpoints detect and correct errors with domain-dependent effectiveness.
| Strategy | Description | Compute |
|---|---|---|
| Text-Only CoT | Sequential derivation with no checkpoints | n steps |
| Visual Checkpoint | Checkpoints every K steps; detect + correct errors | n + 3(n/K) steps |
| Best-of-N | N independent text-only chains, oracle selection | N · n steps (compute-matched) |
| Domain | Effectiveness | Baseline | Visual | Best-of-N | Lift (BoN) | Breaks Through? |
|---|
Our simulation framework yields a domain-dependent answer to whether multimodal interleaved CoT can break through mathematical performance limits: