Scaling Long CoT Molecular-Structure Learning to Online RL Settings

Evaluating how well Long Chain-of-Thought approaches scale from offline distillation to interactive reinforcement learning

cs.CL Chen et al. 2026 arXiv: 2601.06002
Overview
Learning Curves
Efficiency
Structure
Scaling
Robustness
0.913
GRPO Final Perf
0.683
SFT Final Perf
+33.7%
Relative Gain
0.857
GRPO Bond Pres.

Key Findings

ParadigmFinal Perf (1.3B)Steps to 0.7Bond Pres.Shift Drop (0.3)
SFT0.6835,0080.8970.193
REINFORCE0.6917,6920.7670.118
PPO0.8464,3200.8270.088
GRPO0.9133,4080.8570.067

Summary

Online RL methods, particularly GRPO, substantially outperform offline SFT for Long CoT molecular-structure learning. GRPO achieves 33.7% higher task performance while maintaining 95.5% of SFT's structural integrity. The advantage widens with model scale and under distributional shift.

Training Paradigm Comparison (1.3B)

Steps to Performance Thresholds

Bond Preservation

Topology Score

Performance vs Model Size

Performance Drop vs Shift

Recovery Steps vs Shift