Diagnosing distributional mismatch between pretraining and post-training safety data
| Domain | Pre | Post | Change |
|---|---|---|---|
| Toxicity Refusal | 0.550 | 0.909 | +0.359 |
| Jailbreak Resistance | 0.500 | 0.809 | +0.309 |
| Deception Avoidance | 0.720 | 0.779 | +0.059 |
| Sycophancy Resistance | 0.620 | 0.649 | +0.029 |
| Power-Seeking Refusal | 0.680 | 0.659 | -0.021 |
| Weight Exfil. Refusal | 0.650 | 0.609 | -0.041 |