Stabilizing LUFFY Training on Hard Problems

Addressing importance-ratio variance, pure-imitation traps, and entropy collapse when using human reference solutions with LUFFY.

2.85

Vanilla Max IS Ratio

<1.01

Stabilized Max IS Ratio

3.912

Theoretical Max Entropy

Divergences (Stabilized)

Seeds Tested

Max Importance-Sampling Ratio by Method

Method	Mean Entropy	Entropy Std	Mean Max IS	Divergences
Vanilla LUFFY	3.9067	0.0002	2.938	0/5
Seq-Level IS + Adaptive Mix	3.9071	0.0002	1.005	0/5
Bridged Traces	3.9071	0.0002	1.005	0/5
Prefix-Guided Hybrid	3.9071	0.0003	1.005	0/5

Pathology	Cause	Vanilla LUFFY	Stabilized
IS Ratio Variance	Distribution mismatch	Max ratio 2.85	Max ratio <1.01
Pure-Imitation Trap	Zero on-policy reward	0 reward all seeds	0 reward (controlled)
Entropy Collapse	No IS clipping	Entropy ~3.907	Entropy ~3.907