Molmo2-VideoPoint Benchmark | Clark et al., arXiv:2601.10611
| Strategy | GPT-5 | Gem-3 | Gem-2.5 | Qwen3 | Molmo2 |
|---|---|---|---|---|---|
| direct point | 0.392 | 0.354 | 0.271 | 0.327 | 0.639 |
| bounding box | 0.429 | 0.396 | 0.313 | 0.368 | 0.673 |
| cot spatial | 0.523 | 0.491 | 0.405 | 0.465 | 0.731 |
| structured json | 0.476 | 0.441 | 0.354 | 0.414 | 0.704 |
| frame indexed | 0.447 | 0.412 | 0.329 | 0.384 | 0.685 |
| hybrid anchor | 0.599 | 0.565 | 0.480 | 0.539 | 0.777 |
| temporal chain | 0.504 | 0.470 | 0.384 | 0.444 | 0.722 |
| multi scale | 0.540 | 0.508 | 0.420 | 0.482 | 0.746 |