CV

Visual Action-Consequence Inference (VACI)

Can VLMs see what they did? Recovering text-feedback performance through Difference-Augmented Prompting and Chain-of-State Reasoning

Open Problem

All evaluated VLMs show significant performance drops when textual environment feedback is removed (Wang et al., VisGym 2026). This work demonstrates that the feedback gap is closable: Difference-Augmented Prompting (DAP) recovers 103.4% of text-feedback performance on validity accuracy by providing explicit visual difference maps as auxiliary input.

103.4%
Feedback Gap Recovery
DAP exceeds text feedback baseline
0.879
DAP Validity Accuracy
+0.228 vs Naive (0.651)
73.1%
Probe Accuracy
Bottleneck is in reasoning
800
VACI-Bench Transitions
4 envs x 200 each

Naive Baseline

Present pre/post frames directly to VLM. Recovers only 76.6% of text-feedback performance. Struggles with subtle pixel changes.

Difference-Augmented Prompting (DAP)

Compute explicit pixel difference map, apply noise suppression and morphological closing. Transforms comparison task into description task. Achieves 1.000 on Maze2D and Sliding Block.

Visual Chain-of-State (VCoS)

Decompose into 4 steps: State Description, Change Detection, Action Matching, Consequence Derivation. Matches DAP validity (0.879) with greater interpretability.

Method Comparison: All Metrics

Per-Environment Validity Accuracy

Feedback Gap Recovery Ratio

Per-Outcome Accuracy

Difficulty Analysis

Contrastive Probe Training

Main Results Table

MethodValidity Acc.Outcome Acc.Change Detection F1Feedback Gap Ratio
Naive Baseline0.6510.5790.7100.766
DAP0.8790.8790.9181.034
VCoS+DAP0.8790.8280.9181.034
Text Feedback0.8501.000

Per-Environment Breakdown

EnvironmentTotalSuccessBlockedNo EffectNaiveDAPVCoS+DAP
Maze 2D2001079300.5801.0001.000
Sliding Block2001425800.9251.0001.000
Matchstick2001200800.4700.6000.600
Maze 3D200191900.6300.9150.915

Benchmark Statistics