Investigating whether extended training enables policy models to bypass Triangular Consistency verification.
cs.AI RL Safety Reward HackingTC continues rising while true quality collapses -- the signature of verifier hacking.
Pass rate reaches 100% even as quality reaches zero, making hacking invisible to the verifier.
| Phase | Steps | TC Score | True Quality |
|---|---|---|---|
| Genuine Learning | 0 -- 3,000 | 0.35 -> 0.67 | 0.40 -> 0.67 |
| Saturation | 3,000 -- 4,500 | 0.67 -> 0.72 | 0.67 -> 0.72 |
| Hacking | 4,500 -- 10,000 | 0.72 -> 0.93 | 0.72 -> 0.00 |
The policy learns to generate reasoning chains that match evidence surface features without genuine reasoning.