Overview
Trace Length
Reward Sparsity
Convergence
Multi-Seed
Method Comparison (T=16, Moderate Sparsity)
| Method | MSE | Correlation | Ranking Accuracy |
| Monte-Carlo | 0.257 | 0.996 | 0.942 |
| Contrastive | 1.139 | 0.910 | 0.825 |
| Intervention | 1.064 | 0.768 | 0.852 |
| TD(lambda) | 1.197 | 0.207 | 0.572 |
Monte-Carlo training achieves near-perfect credit assignment correlation (0.996), while TD(lambda) struggles significantly with only 0.207 correlation. Contrastive and intervention methods offer a middle ground with good ranking accuracy.
Credit Assignment Correlation vs. Trace Length
Ranking Accuracy vs. Trace Length
Monte-Carlo maintains stable correlation (>0.99) across all trace lengths. Contrastive degrades from 0.93 to 0.56, and intervention from 0.92 to 0.29 at T=64. TD(lambda) approaches zero correlation at T=64.
Ranking Accuracy Across Reward Sparsity Levels
Correlation Across Sparsity Levels
Monte-Carlo, contrastive, and intervention methods are remarkably robust to reward sparsity (ranking accuracy varies by less than 0.02). TD(lambda) shows the most sensitivity, dropping from 0.60 to 0.44 under sparse rewards.
Monte-Carlo converges rapidly to high correlation within ~50 evaluation steps. Contrastive shows steady improvement throughout training. TD(lambda) plateaus at low correlation early, indicating fundamental limitations rather than slow convergence.
Multi-Seed Validation (5 Seeds)
Statistical Summary
| Method | Corr Mean +/- Std | MSE Mean +/- Std | Rank Mean +/- Std |
| Monte-Carlo | 0.994 +/- 0.003 | 0.229 +/- 0.055 | 0.944 +/- 0.004 |
| Contrastive | 0.912 +/- 0.010 | 1.139 +/- 0.050 | 0.825 +/- 0.007 |
| Intervention | 0.767 +/- 0.049 | 1.064 +/- 0.080 | 0.836 +/- 0.013 |
| TD(lambda) | 0.198 +/- 0.026 | 1.008 +/- 0.258 | 0.526 +/- 0.033 |
Monte-Carlo exhibits the lowest variance (std=0.003 for correlation), confirming reliability. Intervention shows the highest variance (std=0.049), suggesting environment sensitivity. TD(lambda) consistently underperforms with systematic failure rather than random variation.