Interactive exploration of CoT transfer experiments across 5 LRMs and 6 reasoning domains (9,600 pairwise transfers)
| Test | Hypothesis | Statistic | p-value | Result |
|---|---|---|---|---|
| Paired t-test | CoT transfer improves accuracy | t = 18.2673 | 2.61e-73 | Significant |
| Kruskal-Wallis | Domain affects transfer success | H = 15.766 | 0.0075 | Significant |
| Mann-Whitney U | Same-family transfers are stronger | U = 4273174.5 | 0.021 | Significant |
| Chi-squared | Agreement exceeds chance (50%) | χ² = 4822.335 | < 0.001 | Significant |
| Source Model | CGS | Acc w/ CoT | Baseline | Lift | N Transfers |
|---|---|---|---|---|---|
| OpenAI-o3-mini | 1.1309 | 0.8865 | 0.7839 | 0.1026 | 1920 |
| DeepSeek-R1 | 1.1172 | 0.9083 | 0.813 | 0.0953 | 1920 |
| QwQ-32B-Preview | 1.1171 | 0.9042 | 0.8094 | 0.0948 | 1920 |
| Claude-3.5-Sonnet | 1.1123 | 0.887 | 0.7974 | 0.0896 | 1920 |
| Gemini-2.0-Flash-Thinking | 1.1006 | 0.8891 | 0.8078 | 0.0812 | 1920 |
| Target Model | Ensemble Acc | Best Single Acc | Best Source | Advantage |
|---|---|---|---|---|
| Claude-3.5-Sonnet | 0.9812 | 0.9417 | OpenAI-o3-mini | +0.0395 |
| OpenAI-o3-mini | 0.9667 | 0.9271 | Claude-3.5-Sonnet | +0.0396 |
| Gemini-2.0-Flash-Thinking | 0.9646 | 0.9062 | Claude-3.5-Sonnet | +0.0584 |
| DeepSeek-R1 | 0.9563 | 0.8896 | Claude-3.5-Sonnet | +0.0667 |
| QwQ-32B-Preview | 0.9271 | 0.8792 | DeepSeek-R1 | +0.0479 |