Mitigating Rank-Aware Training-Inference Mismatch

Comparing scheduled sampling and consistency regularization for autoregressive ranking beyond the first token

Key Findings

+26.3%

Consistency Reg. vs Baseline

0.0691

Best AR Quality (Consist.)

0.0286

KL at t=1 (all methods)

200

Monte Carlo Simulations

Method	AR Quality	TF Quality	Mismatch	KL at t=1
Teacher Forcing	0.0547	0.0623	-0.0278	0.0286
Scheduled Sampling	0.0539	0.0609	-0.0285	0.0286
Consistency Reg.	0.0691	0.0620	-0.0287	0.0286