Can VLMs see what they did? Recovering text-feedback performance through Difference-Augmented Prompting and Chain-of-State Reasoning
All evaluated VLMs show significant performance drops when textual environment feedback is removed (Wang et al., VisGym 2026). This work demonstrates that the feedback gap is closable: Difference-Augmented Prompting (DAP) recovers 103.4% of text-feedback performance on validity accuracy by providing explicit visual difference maps as auxiliary input.
Present pre/post frames directly to VLM. Recovers only 76.6% of text-feedback performance. Struggles with subtle pixel changes.
Compute explicit pixel difference map, apply noise suppression and morphological closing. Transforms comparison task into description task. Achieves 1.000 on Maze2D and Sliding Block.
Decompose into 4 steps: State Description, Change Detection, Action Matching, Consequence Derivation. Matches DAP validity (0.879) with greater interpretability.
| Method | Validity Acc. | Outcome Acc. | Change Detection F1 | Feedback Gap Ratio |
|---|---|---|---|---|
| Naive Baseline | 0.651 | 0.579 | 0.710 | 0.766 |
| DAP | 0.879 | 0.879 | 0.918 | 1.034 |
| VCoS+DAP | 0.879 | 0.828 | 0.918 | 1.034 |
| Text Feedback | 0.850 | 1.000 |
| Environment | Total | Success | Blocked | No Effect | Naive | DAP | VCoS+DAP |
|---|---|---|---|---|---|---|---|
| Maze 2D | 200 | 107 | 93 | 0 | 0.580 | 1.000 | 1.000 |
| Sliding Block | 200 | 142 | 58 | 0 | 0.925 | 1.000 | 1.000 |
| Matchstick | 200 | 120 | 0 | 80 | 0.470 | 0.600 | 0.600 |
| Maze 3D | 200 | 191 | 9 | 0 | 0.630 | 0.915 | 0.915 |