Which capability matters more for detecting paper-code discrepancies?
| Model | Type | Score |
|---|---|---|
| Claude-3.5-Opus | hybrid | 0.892 |
| GPT-5 | reasoning | 0.890 |
| GPT-5-Turbo | hybrid | 0.882 |
| Claude-3.5-Sonnet | reasoning | 0.871 |
| GPT-5-Mini | reasoning | 0.846 |
| Gemini-Ultra | reasoning | 0.842 |
| GPT-5-Codex | coding | 0.794 |
| DeepSeek-Coder-V3 | coding | 0.769 |
| CodeLlama-70B | coding | 0.702 |
| StarCoder2-15B | coding | 0.612 |