Prompt Strategy Evaluation for Video Spatio-Temporal Pointing

Molmo2-VideoPoint Benchmark | Clark et al., arXiv:2601.10611

0.599
Best Baseline F1 (GPT-5)
0.777
Molmo2-7B F1
+167%
Ablation Improvement
Hybrid
Best Strategy
0.442
Best Format F1

F1 Score: Model x Prompt Strategy

Component Ablation

Output Format Sensitivity

Model Performance Range

Strategy Comparison Table

StrategyGPT-5Gem-3Gem-2.5Qwen3Molmo2
direct point0.3920.3540.2710.3270.639
bounding box0.4290.3960.3130.3680.673
cot spatial0.5230.4910.4050.4650.731
structured json0.4760.4410.3540.4140.704
frame indexed0.4470.4120.3290.3840.685
hybrid anchor0.5990.5650.4800.5390.777
temporal chain0.5040.4700.3840.4440.722
multi scale0.5400.5080.4200.4820.746