Training Process Reward Models for Long LLM Reasoning Traces

Comparing Monte-Carlo, TD(lambda), Contrastive, and Intervention-based PRM training methodologies across trace lengths, reward sparsity levels, and random seeds.

Overview
Trace Length
Reward Sparsity
Convergence
Multi-Seed

Method Comparison (T=16, Moderate Sparsity)

MethodMSECorrelationRanking Accuracy
Monte-Carlo0.2570.9960.942
Contrastive1.1390.9100.825
Intervention1.0640.7680.852
TD(lambda)1.1970.2070.572
Monte-Carlo training achieves near-perfect credit assignment correlation (0.996), while TD(lambda) struggles significantly with only 0.207 correlation. Contrastive and intervention methods offer a middle ground with good ranking accuracy.

Final Metrics Comparison

Credit Assignment Correlation vs. Trace Length

Ranking Accuracy vs. Trace Length

Monte-Carlo maintains stable correlation (>0.99) across all trace lengths. Contrastive degrades from 0.93 to 0.56, and intervention from 0.92 to 0.29 at T=64. TD(lambda) approaches zero correlation at T=64.

Ranking Accuracy Across Reward Sparsity Levels

Correlation Across Sparsity Levels

Monte-Carlo, contrastive, and intervention methods are remarkably robust to reward sparsity (ranking accuracy varies by less than 0.02). TD(lambda) shows the most sensitivity, dropping from 0.60 to 0.44 under sparse rewards.

Correlation Convergence

MSE Convergence

Monte-Carlo converges rapidly to high correlation within ~50 evaluation steps. Contrastive shows steady improvement throughout training. TD(lambda) plateaus at low correlation early, indicating fundamental limitations rather than slow convergence.

Multi-Seed Validation (5 Seeds)

Statistical Summary

MethodCorr Mean +/- StdMSE Mean +/- StdRank Mean +/- Std
Monte-Carlo0.994 +/- 0.0030.229 +/- 0.0550.944 +/- 0.004
Contrastive0.912 +/- 0.0101.139 +/- 0.0500.825 +/- 0.007
Intervention0.767 +/- 0.0491.064 +/- 0.0800.836 +/- 0.013
TD(lambda)0.198 +/- 0.0261.008 +/- 0.2580.526 +/- 0.033
Monte-Carlo exhibits the lowest variance (std=0.003 for correlation), confirming reliability. Intervention shows the highest variance (std=0.049), suggesting environment sensitivity. TD(lambda) consistently underperforms with systematic failure rather than random variation.