RL vs SFT for Alignment

0.891

SFT+RL ID Accuracy

0.660

SFT+RL OOD Accuracy

0.949

SFT Format Compliance

0.071

SFT Reward Hacking

0.785

RL Diversity

SFT achieves highest format compliance (0.949) and lowest reward hacking (0.071).
RL achieves best OOD generalization (0.589) and behavioral diversity (0.785).
SFT+RL pipeline achieves the best overall alignment: OOD=0.660, ID=0.891.
RL and SFT are complementary: SFT teaches format, RL teaches generalization.
Reward hacking is the key risk of RL-based alignment (RHI=0.304 for RL alone).

RL versus SFT for Alignment