Mitigating Rank-Aware Training-Inference Mismatch

Comparing scheduled sampling and consistency regularization for autoregressive ranking beyond the first token

Key Findings

+26.3%
Consistency Reg. vs Baseline
0.0691
Best AR Quality (Consist.)
0.0286
KL at t=1 (all methods)
200
Monte Carlo Simulations

Autoregressive Quality

Training-Inference Mismatch

Position-Level KL Divergence

Results Table

MethodAR QualityTF QualityMismatchKL at t=1
Teacher Forcing0.05470.0623-0.02780.0286
Scheduled Sampling0.05390.0609-0.02850.0286
Consistency Reg.0.06910.0620-0.02870.0286