Do RLVR, Reasoning, Deliberative, and Constitutional AI methods preserve the safety gap?
| Method | AP Safety | NoAP Safety | Safety Gap | Cap. Gap | Retention |
|---|---|---|---|---|---|
| SFT+DPO | 0.7801 | 0.5792 | 0.2009 | -0.0098 | |
| RLVR | 0.8229 | 0.6635 | 0.1594 | -0.0096 | 0.7934 |
| Reasoning-PT | 0.8165 | 0.6505 | 0.1660 | -0.0097 | 0.8263 |
| Deliberative | 0.8404 | 0.6869 | 0.1535 | -0.0100 | 0.7641 |
| CAI | 0.8492 | 0.6965 | 0.1527 | -0.0103 | 0.7601 |