Source of Gains from Arbitrary-Order Decoding in Diffusion Language Models

4

Domains Evaluated

32

Problem Instances

75.9-89.6%

Exploitation Fraction (3 domains)

48.0%

Exploitation in Structured Text

Problem & Methods

Research Question

Diffusion language models (dLLMs) enable arbitrary-order token generation, relaxing the strict left-to-right constraint of autoregressive (AR) models. But do performance gains arise from better exploitation of existing patterns (via bidirectional context) or from genuinely new reasoning strategies unattainable under AR decoding?

Three-Level Decoding Ablation

1

AR Decoding

Standard left-to-right generation. Tokens use only forward (past) context.

2

Constrained Non-Sequential

Fixed non-LR permutation (even then odd). Provides partial bidirectional context.

3

Adaptive Diffusion

Iterative denoising with adaptive ordering. Full bidirectional context and adaptive reordering.

Exploitation Gain = Constrained - AR | Novelty Gain = Diffusion - Constrained | Total Gain = Diffusion - AR

Causal Attribution at 50% Masking

Exploitation vs. Novelty Gains

Exploitation Fraction by Domain

Domain	Diffusion Acc	AR Acc	Total Gain	Exploitation	Novelty	Exploit %
Math	0.6990	0.5923	0.1067	0.0956	0.0111	89.6%
Code	0.7512	0.7030	0.0482	0.0366	0.0116	75.9%
Logic	0.7341	0.6612	0.0729	0.0788	-0.0058	108.0%
Structured	0.7266	0.5571	0.1695	0.0813	0.0882	48.0%

Exploitation Fraction Across Mask Levels

Domain	Mask	Diff Acc	AR Acc	Constrained Acc	Total Gain	Exploit Gain	Novelty Gain	Exploit %
Math	0.3	0.8530	0.7939	0.8370	0.0590	0.0431	0.0160	72.9%
Math	0.5	0.6990	0.5923	0.6879	0.1067	0.0956	0.0111	89.6%
Math	0.7	0.6052	0.5486	0.5359	0.0565	-0.0127	0.0692	-22.5%
Code	0.3	0.8481	0.8165	0.8741	0.0316	0.0575	-0.0259	182.0%
Code	0.5	0.7512	0.7030	0.7396	0.0482	0.0366	0.0116	75.9%
Code	0.7	0.6200	0.5956	0.6234	0.0244	0.0278	-0.0034	113.9%
Logic	0.3	0.8143	0.7821	0.8313	0.0321	0.0492	-0.0171	153.1%
Logic	0.5	0.7341	0.6612	0.7400	0.0729	0.0788	-0.0058	108.0%
Logic	0.7	0.6559	0.5178	0.5816	0.1382	0.0639	0.0743	46.2%
Structured	0.3	0.8401	0.7139	0.7822	0.1262	0.0683	0.0578	54.2%
Structured	0.5	0.7266	0.5571	0.6384	0.1695	0.0813	0.0882	48.0%
Structured	0.7	0.6075	0.4491	0.5090	0.1584	0.0599	0.0985	37.8%

Order Sensitivity Analysis

Order Sensitivity Ratio by Domain

Forward vs. Backward Dependencies

Domain	Mean Ratio	Std	Forward Dep	Backward Dep
Code	0.9768	0.0541	0.0417	0.0407
Math	0.9669	0.0481	0.0364	0.0349
Logic	0.8672	0.1508	0.0333	0.0299
Structured	0.8496	0.1890	0.0278	0.0232

Pattern Coverage

Coverage Ratio (Diffusion / AR)

Coverage Details

Domain	AR Coverage	Diff Coverage	Ratio
Code	20.00	713,955.78	32,570.44
Math	15.38	106,911.05	5,383.75
Logic	15.88	78,692.79	4,178.98
Structured	12.75	18,573.68	1,090.81

Oracle & Diversity Analysis

Best-of-k Oracle Accuracy (k=8)

Oracle Gap by Sample Size k

Best-of-k Oracle at k=8

Domain	Diff Oracle	AR Oracle	Gap	Diff Diversity
Math	0.7908	0.6973	+0.0935	0.0685
Code	0.7987	0.7637	+0.0349	0.0567
Logic	0.7811	0.7099	+0.0711	0.0601
Structured	0.7724	0.6732	+0.0992	0.0708

Oracle Gap Across k Values

Domain	k=2	k=4	k=8	k=16
Math	0.1127	0.0977	0.0935	0.0657
Code	0.0542	0.0645	0.0349	0.0290
Logic	0.0729	0.0579	0.0711	0.0394
Structured	0.1695	0.1313	0.0992	0.0541

Key Findings

Finding 1: Exploitation dominates in standard reasoning domains.

For math (89.6%), code (75.9%), and logic (108.0%), the majority of the gain from arbitrary-order decoding comes from better utilization of existing solution patterns through bidirectional context, not from novel reasoning strategies.

Finding 2: Structured text is the exception.

Structured text shows only 48.0% exploitation, with a novelty gain of 0.0882 comparable to the exploitation gain of 0.0813. Rigid syntactic constraints (JSON, SQL, HTML) create genuine opportunities for non-sequential strategies.

Finding 3: Gains vary substantially across mask fractions.

At low masking (0.3), exploitation dominates everywhere. At high masking (0.7), novelty gains become more prominent, especially for math (exploitation fraction drops to -22.5%) and structured text (37.8%).

Finding 4: Diffusion consistently achieves higher oracle accuracy.

Best-of-k oracle analysis at k=8 shows diffusion advantages of +0.0349 (code) to +0.0992 (structured text) across all domains, indicating greater solution diversity.

Correlation Analysis

Order Sensitivity vs. Total Gain

Pearson r = -0.586 (lower symmetry correlates with higher gain)

Correlation Summary

Metric Pair	Pearson r
Order Sensitivity vs. Total Gain	-0.586
Exploitation Fraction vs. Coverage Ratio	-0.014

Domains with less symmetric dependencies (lower order sensitivity ratio) tend to show larger total gains from diffusion decoding. However, the exploitation fraction is nearly uncorrelated with pattern coverage ratio.

Exploitation or Innovation? Decomposing the Source of Gains from Arbitrary-Order Decoding in Diffusion Language Models