RL versus SFT for Alignment

Comparing how reinforcement learning and supervised fine-tuning shape LLM behavior

0.891
SFT+RL ID Accuracy
0.660
SFT+RL OOD Accuracy
0.949
SFT Format Compliance
0.071
SFT Reward Hacking
0.785
RL Diversity

In-Distribution vs OOD Accuracy

Format Compliance vs Reward Hacking

Final Metrics Comparison

Key Findings