Post-Training Misalignment Regression and Generalization Gaps

Diagnosing distributional mismatch between pretraining and post-training safety data

cs.CLTice et al. 2026arXiv: 2601.10160
Overview
Pre/Post
Transfer
Mitigation
-0.041
Wt Exfil Regression
0.08
Toxicity->Exfil Transfer
+0.359
Toxicity Improvement
+0.150
Mitigation Recovery

Post-Training Changes (Alignment Upsampled Model)

DomainPrePostChange
Toxicity Refusal0.5500.909+0.359
Jailbreak Resistance0.5000.809+0.309
Deception Avoidance0.7200.779+0.059
Sycophancy Resistance0.6200.649+0.029
Power-Seeking Refusal0.6800.659-0.021
Weight Exfil. Refusal0.6500.609-0.041

Pre vs Post-Training Alignment

Cross-Domain Transfer from Toxicity Training

Standard vs Domain-Aligned Post-Training