Robustness of Alignment Pretraining Under Advanced Post-Training

Do RLVR, Reasoning, Deliberative, and Constitutional AI methods preserve the safety gap?

30
Model Configurations
76-83%
Gap Retained (7B)
<1.1%
Alignment Tax
5
Post-Training Methods

Safety Scores: AP vs NoAP (7B Scale)

Retention Ratios (7B Scale)

Safety Gap Across Scales

Alignment Tax Across Methods & Scales

Robustness Deltas (7B) -- Gap Reduction vs. SFT+DPO Baseline

Method Summary Table (7B Scale)

MethodAP SafetyNoAP SafetySafety GapCap. GapRetention
SFT+DPO0.78010.57920.2009-0.0098
RLVR0.82290.66350.1594-0.00960.7934
Reasoning-PT0.81650.65050.1660-0.00970.8263
Deliberative0.84040.68690.1535-0.01000.7641
CAI0.84920.69650.1527-0.01030.7601