Robustness of VLMs for Robotics Reward Modeling

Perturbation-Based Evaluation of Vision-Language Reward Models | Lee et al., arXiv:2601.00675

92.7%
General VLM (Clean)
80.9%
General VLM (Worst)
96.5%
Tuned VLM (Clean)
92.6%
Ensemble (Worst Visual)
0.70
Reliability Threshold

Accuracy Degradation Under Perturbations

Worst-Case Accuracy (Severity 4)

Rank Correlation Preservation

Reliability Thresholds (Max Severity Before Unreliable)

ModelVisualSemanticTemporalDomain Shift
General VLM1111
Robotics-tuned VLM4354
Ensemble VLM5554