RL versus SFT for Alignment
Comparing how reinforcement learning and supervised fine-tuning shape LLM behavior
0.891
SFT+RL ID Accuracy
0.660
SFT+RL OOD Accuracy
0.949
SFT Format Compliance
0.071
SFT Reward Hacking
0.785
RL Diversity
In-Distribution vs OOD Accuracy
Format Compliance vs Reward Hacking
Final Metrics Comparison
Key Findings
SFT achieves highest format compliance (0.949) and lowest reward hacking (0.071).
RL achieves best OOD generalization (0.589) and behavioral diversity (0.785).
SFT+RL pipeline achieves the best overall alignment: OOD=0.660, ID=0.891.
RL and SFT are complementary: SFT teaches format, RL teaches generalization.
Reward hacking is the key risk of RL-based alignment (RHI=0.304 for RL alone).