Robustness of VLMs for Robotics Reward Modeling
Perturbation-Based Evaluation of Vision-Language Reward Models | Lee et al., arXiv:2601.00675
92.7%
General VLM (Clean)
80.9%
General VLM (Worst)
96.5%
Tuned VLM (Clean)
92.6%
Ensemble (Worst Visual)
0.70
Reliability Threshold
Accuracy Degradation Under Perturbations
Perturbation Type:
Visual
Semantic
Temporal
Domain Shift
Worst-Case Accuracy (Severity 4)
Rank Correlation Preservation
Reliability Thresholds (Max Severity Before Unreliable)
Model
Visual
Semantic
Temporal
Domain Shift
General VLM
1
1
1
1
Robotics-tuned VLM
4
3
5
4
Ensemble VLM
5
5
5
4