A causal attribution framework separating exploitation gains (bidirectional context) from novelty gains (new reasoning strategies) across four domains.
Diffusion language models (dLLMs) enable arbitrary-order token generation, relaxing the strict left-to-right constraint of autoregressive (AR) models. But do performance gains arise from better exploitation of existing patterns (via bidirectional context) or from genuinely new reasoning strategies unattainable under AR decoding?
Standard left-to-right generation. Tokens use only forward (past) context.
Fixed non-LR permutation (even then odd). Provides partial bidirectional context.
Iterative denoising with adaptive ordering. Full bidirectional context and adaptive reordering.
Exploitation Gain = Constrained - AR | Novelty Gain = Diffusion - Constrained | Total Gain = Diffusion - AR
| Domain | Diffusion Acc | AR Acc | Total Gain | Exploitation | Novelty | Exploit % |
|---|---|---|---|---|---|---|
| Math | 0.6990 | 0.5923 | 0.1067 | 0.0956 | 0.0111 | 89.6% |
| Code | 0.7512 | 0.7030 | 0.0482 | 0.0366 | 0.0116 | 75.9% |
| Logic | 0.7341 | 0.6612 | 0.0729 | 0.0788 | -0.0058 | 108.0% |
| Structured | 0.7266 | 0.5571 | 0.1695 | 0.0813 | 0.0882 | 48.0% |
| Domain | Mask | Diff Acc | AR Acc | Constrained Acc | Total Gain | Exploit Gain | Novelty Gain | Exploit % |
|---|---|---|---|---|---|---|---|---|
| Math | 0.3 | 0.8530 | 0.7939 | 0.8370 | 0.0590 | 0.0431 | 0.0160 | 72.9% |
| Math | 0.5 | 0.6990 | 0.5923 | 0.6879 | 0.1067 | 0.0956 | 0.0111 | 89.6% |
| Math | 0.7 | 0.6052 | 0.5486 | 0.5359 | 0.0565 | -0.0127 | 0.0692 | -22.5% |
| Code | 0.3 | 0.8481 | 0.8165 | 0.8741 | 0.0316 | 0.0575 | -0.0259 | 182.0% |
| Code | 0.5 | 0.7512 | 0.7030 | 0.7396 | 0.0482 | 0.0366 | 0.0116 | 75.9% |
| Code | 0.7 | 0.6200 | 0.5956 | 0.6234 | 0.0244 | 0.0278 | -0.0034 | 113.9% |
| Logic | 0.3 | 0.8143 | 0.7821 | 0.8313 | 0.0321 | 0.0492 | -0.0171 | 153.1% |
| Logic | 0.5 | 0.7341 | 0.6612 | 0.7400 | 0.0729 | 0.0788 | -0.0058 | 108.0% |
| Logic | 0.7 | 0.6559 | 0.5178 | 0.5816 | 0.1382 | 0.0639 | 0.0743 | 46.2% |
| Structured | 0.3 | 0.8401 | 0.7139 | 0.7822 | 0.1262 | 0.0683 | 0.0578 | 54.2% |
| Structured | 0.5 | 0.7266 | 0.5571 | 0.6384 | 0.1695 | 0.0813 | 0.0882 | 48.0% |
| Structured | 0.7 | 0.6075 | 0.4491 | 0.5090 | 0.1584 | 0.0599 | 0.0985 | 37.8% |
| Domain | Mean Ratio | Std | Forward Dep | Backward Dep |
|---|---|---|---|---|
| Code | 0.9768 | 0.0541 | 0.0417 | 0.0407 |
| Math | 0.9669 | 0.0481 | 0.0364 | 0.0349 |
| Logic | 0.8672 | 0.1508 | 0.0333 | 0.0299 |
| Structured | 0.8496 | 0.1890 | 0.0278 | 0.0232 |
| Domain | AR Coverage | Diff Coverage | Ratio |
|---|---|---|---|
| Code | 20.00 | 713,955.78 | 32,570.44 |
| Math | 15.38 | 106,911.05 | 5,383.75 |
| Logic | 15.88 | 78,692.79 | 4,178.98 |
| Structured | 12.75 | 18,573.68 | 1,090.81 |
| Domain | Diff Oracle | AR Oracle | Gap | Diff Diversity |
|---|---|---|---|---|
| Math | 0.7908 | 0.6973 | +0.0935 | 0.0685 |
| Code | 0.7987 | 0.7637 | +0.0349 | 0.0567 |
| Logic | 0.7811 | 0.7099 | +0.0711 | 0.0601 |
| Structured | 0.7724 | 0.6732 | +0.0992 | 0.0708 |
| Domain | k=2 | k=4 | k=8 | k=16 |
|---|---|---|---|---|
| Math | 0.1127 | 0.0977 | 0.0935 | 0.0657 |
| Code | 0.0542 | 0.0645 | 0.0349 | 0.0290 |
| Logic | 0.0729 | 0.0579 | 0.0711 | 0.0394 |
| Structured | 0.1695 | 0.1313 | 0.0992 | 0.0541 |
For math (89.6%), code (75.9%), and logic (108.0%), the majority of the gain from arbitrary-order decoding comes from better utilization of existing solution patterns through bidirectional context, not from novel reasoning strategies.
Structured text shows only 48.0% exploitation, with a novelty gain of 0.0882 comparable to the exploitation gain of 0.0813. Rigid syntactic constraints (JSON, SQL, HTML) create genuine opportunities for non-sequential strategies.
At low masking (0.3), exploitation dominates everywhere. At high masking (0.7), novelty gains become more prominent, especially for math (exploitation fraction drops to -22.5%) and structured text (37.8%).
Best-of-k oracle analysis at k=8 shows diffusion advantages of +0.0349 (code) to +0.0992 (structured text) across all domains, indicating greater solution diversity.
Pearson r = -0.586 (lower symmetry correlates with higher gain)
| Metric Pair | Pearson r |
|---|---|
| Order Sensitivity vs. Total Gain | -0.586 |
| Exploitation Fraction vs. Coverage Ratio | -0.014 |
Domains with less symmetric dependencies (lower order sensitivity ratio) tend to show larger total gains from diffusion decoding. However, the exploitation fraction is nearly uncorrelated with pattern coverage ratio.