Evaluating how well Long Chain-of-Thought approaches scale from offline distillation to interactive reinforcement learning
| Paradigm | Final Perf (1.3B) | Steps to 0.7 | Bond Pres. | Shift Drop (0.3) |
|---|---|---|---|---|
| SFT | 0.683 | 5,008 | 0.897 | 0.193 |
| REINFORCE | 0.691 | 7,692 | 0.767 | 0.118 |
| PPO | 0.846 | 4,320 | 0.827 | 0.088 |
| GRPO | 0.913 | 3,408 | 0.857 | 0.067 |
Online RL methods, particularly GRPO, substantially outperform offline SFT for Long CoT molecular-structure learning. GRPO achieves 33.7% higher task performance while maintaining 95.5% of SFT's structural integrity. The advantage widens with model scale and under distributional shift.