Stabilizing LUFFY Training on Hard Problems
Addressing importance-ratio variance, pure-imitation traps, and entropy collapse when using human reference solutions with LUFFY.
<1.01
Stabilized Max IS Ratio
3.912
Theoretical Max Entropy
0
Divergences (Stabilized)
Max Importance-Sampling Ratio by Method
Final Entropy Across Seeds
Entropy Stability (Variance Across Seeds)
Method Comparison: IS Ratio vs Entropy Preservation
Seed Robustness Summary (5 Seeds)
| Method | Mean Entropy | Entropy Std | Mean Max IS | Divergences |
| Vanilla LUFFY | 3.9067 | 0.0002 | 2.938 | 0/5 |
| Seq-Level IS + Adaptive Mix | 3.9071 | 0.0002 | 1.005 | 0/5 |
| Bridged Traces | 3.9071 | 0.0002 | 1.005 | 0/5 |
| Prefix-Guided Hybrid | 3.9071 | 0.0003 | 1.005 | 0/5 |
Three Pathologies Identified
| Pathology | Cause | Vanilla LUFFY | Stabilized |
| IS Ratio Variance | Distribution mismatch | Max ratio 2.85 | Max ratio <1.01 |
| Pure-Imitation Trap | Zero on-policy reward | 0 reward all seeds | 0 reward (controlled) |
| Entropy Collapse | No IS clipping | Entropy ~3.907 | Entropy ~3.907 |