Evaluation of Robo-Dopamine on RoboRewardBench

Simulation-Based Benchmarking of Vision-Language Reward Models | Lee et al., arXiv:2601.00675

97.8%
Best Accuracy (Fine-tuned VLM)
96.3%
Process RM Accuracy
0.995
Process RM Consistency
+1.2%
Process vs Outcome (Assembly)
p<0.001
All Pairwise Tests

Overall Benchmark Accuracy by Model

Task-Specific Accuracy

Process vs Outcome by Task

Temporal Consistency

Backbone Scaling

Pairwise Statistical Comparisons

Comparisonp-valueEffect Size (d)Significant
General VLM vs Fine-tuned VLM<1e-1001.04Yes
General VLM vs Outcome RM<1e-290.52Yes
General VLM vs Process RM<1e-540.72Yes
Fine-tuned VLM vs Outcome RM<1e-340.56Yes
Fine-tuned VLM vs Process RM<1e-200.42Yes
Outcome RM vs Process RM<1e-40.19Yes