Exploitation or Innovation? Decomposing the Source of Gains from Arbitrary-Order Decoding in Diffusion Language Models

A causal attribution framework separating exploitation gains (bidirectional context) from novelty gains (new reasoning strategies) across four domains.

4
Domains Evaluated
32
Problem Instances
75.9-89.6%
Exploitation Fraction (3 domains)
48.0%
Exploitation in Structured Text

Problem & Methods

Research Question

Diffusion language models (dLLMs) enable arbitrary-order token generation, relaxing the strict left-to-right constraint of autoregressive (AR) models. But do performance gains arise from better exploitation of existing patterns (via bidirectional context) or from genuinely new reasoning strategies unattainable under AR decoding?

Three-Level Decoding Ablation

1

AR Decoding

Standard left-to-right generation. Tokens use only forward (past) context.

2

Constrained Non-Sequential

Fixed non-LR permutation (even then odd). Provides partial bidirectional context.

3

Adaptive Diffusion

Iterative denoising with adaptive ordering. Full bidirectional context and adaptive reordering.

Exploitation Gain = Constrained - AR  |  Novelty Gain = Diffusion - Constrained  |  Total Gain = Diffusion - AR

Causal Attribution at 50% Masking

Exploitation vs. Novelty Gains

Exploitation Fraction by Domain

DomainDiffusion AccAR AccTotal GainExploitationNoveltyExploit %
Math0.69900.59230.10670.09560.011189.6%
Code0.75120.70300.04820.03660.011675.9%
Logic0.73410.66120.07290.0788-0.0058108.0%
Structured0.72660.55710.16950.08130.088248.0%

Exploitation Fraction Across Mask Levels

DomainMaskDiff AccAR AccConstrained AccTotal GainExploit GainNovelty GainExploit %
Math0.30.85300.79390.83700.05900.04310.016072.9%
Math0.50.69900.59230.68790.10670.09560.011189.6%
Math0.70.60520.54860.53590.0565-0.01270.0692-22.5%
Code0.30.84810.81650.87410.03160.0575-0.0259182.0%
Code0.50.75120.70300.73960.04820.03660.011675.9%
Code0.70.62000.59560.62340.02440.0278-0.0034113.9%
Logic0.30.81430.78210.83130.03210.0492-0.0171153.1%
Logic0.50.73410.66120.74000.07290.0788-0.0058108.0%
Logic0.70.65590.51780.58160.13820.06390.074346.2%
Structured0.30.84010.71390.78220.12620.06830.057854.2%
Structured0.50.72660.55710.63840.16950.08130.088248.0%
Structured0.70.60750.44910.50900.15840.05990.098537.8%

Order Sensitivity Analysis

Order Sensitivity Ratio by Domain

Forward vs. Backward Dependencies

DomainMean RatioStdForward DepBackward Dep
Code0.97680.05410.04170.0407
Math0.96690.04810.03640.0349
Logic0.86720.15080.03330.0299
Structured0.84960.18900.02780.0232

Pattern Coverage

Coverage Ratio (Diffusion / AR)

Coverage Details

DomainAR CoverageDiff CoverageRatio
Code20.00713,955.7832,570.44
Math15.38106,911.055,383.75
Logic15.8878,692.794,178.98
Structured12.7518,573.681,090.81

Oracle & Diversity Analysis

Best-of-k Oracle Accuracy (k=8)

Oracle Gap by Sample Size k

Best-of-k Oracle at k=8

DomainDiff OracleAR OracleGapDiff Diversity
Math0.79080.6973+0.09350.0685
Code0.79870.7637+0.03490.0567
Logic0.78110.7099+0.07110.0601
Structured0.77240.6732+0.09920.0708

Oracle Gap Across k Values

Domaink=2k=4k=8k=16
Math0.11270.09770.09350.0657
Code0.05420.06450.03490.0290
Logic0.07290.05790.07110.0394
Structured0.16950.13130.09920.0541

Key Findings

Finding 1: Exploitation dominates in standard reasoning domains.

For math (89.6%), code (75.9%), and logic (108.0%), the majority of the gain from arbitrary-order decoding comes from better utilization of existing solution patterns through bidirectional context, not from novel reasoning strategies.

Finding 2: Structured text is the exception.

Structured text shows only 48.0% exploitation, with a novelty gain of 0.0882 comparable to the exploitation gain of 0.0813. Rigid syntactic constraints (JSON, SQL, HTML) create genuine opportunities for non-sequential strategies.

Finding 3: Gains vary substantially across mask fractions.

At low masking (0.3), exploitation dominates everywhere. At high masking (0.7), novelty gains become more prominent, especially for math (exploitation fraction drops to -22.5%) and structured text (37.8%).

Finding 4: Diffusion consistently achieves higher oracle accuracy.

Best-of-k oracle analysis at k=8 shows diffusion advantages of +0.0349 (code) to +0.0992 (structured text) across all domains, indicating greater solution diversity.

Correlation Analysis

Order Sensitivity vs. Total Gain

Pearson r = -0.586 (lower symmetry correlates with higher gain)

Correlation Summary

Metric PairPearson r
Order Sensitivity vs. Total Gain-0.586
Exploitation Fraction vs. Coverage Ratio-0.014

Domains with less symmetric dependencies (lower order sensitivity ratio) tend to show larger total gains from diffusion decoding. However, the exploitation fraction is nearly uncorrelated with pattern coverage ratio.