Simulation-Based Benchmarking of Vision-Language Reward Models | Lee et al., arXiv:2601.00675
| Comparison | p-value | Effect Size (d) | Significant |
|---|---|---|---|
| General VLM vs Fine-tuned VLM | <1e-100 | 1.04 | Yes |
| General VLM vs Outcome RM | <1e-29 | 0.52 | Yes |
| General VLM vs Process RM | <1e-54 | 0.72 | Yes |
| Fine-tuned VLM vs Outcome RM | <1e-34 | 0.56 | Yes |
| Fine-tuned VLM vs Process RM | <1e-20 | 0.42 | Yes |
| Outcome RM vs Process RM | <1e-4 | 0.19 | Yes |