Stabilizing LUFFY Training on Hard Problems

Addressing importance-ratio variance, pure-imitation traps, and entropy collapse when using human reference solutions with LUFFY.

2.85
Vanilla Max IS Ratio
<1.01
Stabilized Max IS Ratio
3.912
Theoretical Max Entropy
0
Divergences (Stabilized)
5
Seeds Tested

Max Importance-Sampling Ratio by Method

Final Entropy Across Seeds

Entropy Stability (Variance Across Seeds)

Method Comparison: IS Ratio vs Entropy Preservation

Seed Robustness Summary (5 Seeds)

MethodMean EntropyEntropy StdMean Max ISDivergences
Vanilla LUFFY3.90670.00022.9380/5
Seq-Level IS + Adaptive Mix3.90710.00021.0050/5
Bridged Traces3.90710.00021.0050/5
Prefix-Guided Hybrid3.90710.00031.0050/5

Three Pathologies Identified

PathologyCauseVanilla LUFFYStabilized
IS Ratio VarianceDistribution mismatchMax ratio 2.85Max ratio <1.01
Pure-Imitation TrapZero on-policy reward0 reward all seeds0 reward (controlled)
Entropy CollapseNo IS clippingEntropy ~3.907Entropy ~3.907