Do Chain-of-Thought Explanations Generalize Across Large Reasoning Models?

Interactive exploration of CoT transfer experiments across 5 LRMs and 6 reasoning domains (9,600 pairwise transfers)

1.1156
Mean CGS
9.27%
Accuracy Lift
85.44%
Answer Agreement
17.43%
Helpful Rate
8.16%
Harmful Rate
9,600
Transfer Trials

CoT Generalization Score by Source Model

Domain-Stratified Transfer Rates

Ensemble CoT vs. Best Single-Source Transfer

Accuracy: With CoT vs. Baseline

Pairwise CoT Transfer Lift Matrix

Cross-Model Answer Agreement by Source Model and Domain

Statistical Validation

TestHypothesisStatisticp-valueResult
Paired t-testCoT transfer improves accuracy t = 18.26732.61e-73Significant
Kruskal-WallisDomain affects transfer success H = 15.7660.0075Significant
Mann-Whitney USame-family transfers are stronger U = 4273174.50.021Significant
Chi-squaredAgreement exceeds chance (50%) χ² = 4822.335< 0.001Significant

CoT Generalization Scores (Detail)

Source ModelCGSAcc w/ CoTBaselineLiftN Transfers
OpenAI-o3-mini1.13090.88650.78390.10261920
DeepSeek-R11.11720.90830.8130.09531920
QwQ-32B-Preview1.11710.90420.80940.09481920
Claude-3.5-Sonnet1.11230.8870.79740.08961920
Gemini-2.0-Flash-Thinking1.10060.88910.80780.08121920

Ensemble CoT Comparison

Target ModelEnsemble AccBest Single AccBest SourceAdvantage
Claude-3.5-Sonnet0.98120.9417OpenAI-o3-mini+0.0395
OpenAI-o3-mini0.96670.9271Claude-3.5-Sonnet+0.0396
Gemini-2.0-Flash-Thinking0.96460.9062Claude-3.5-Sonnet+0.0584
DeepSeek-R10.95630.8896Claude-3.5-Sonnet+0.0667
QwQ-32B-Preview0.92710.8792DeepSeek-R1+0.0479