General Reasoning vs Coding Specialization for SciCoQA

Which capability matters more for detecting paper-code discrepancies?

cs.CLBaumgartner et al. 2026arXiv: 2601.12910
Overview
Models
Ablation
Subtasks
Correlations
Optimal Mix
0.987
Reasoning Corr.
-0.355
Coding Corr.
2.4x
Reasoning Impact
60/20/20
Optimal R/C/I

Model Rankings

ModelTypeScore
Claude-3.5-Opushybrid0.892
GPT-5reasoning0.890
GPT-5-Turbohybrid0.882
Claude-3.5-Sonnetreasoning0.871
GPT-5-Minireasoning0.846
Gemini-Ultrareasoning0.842
GPT-5-Codexcoding0.794
DeepSeek-Coder-V3coding0.769
CodeLlama-70Bcoding0.702
StarCoder2-15Bcoding0.612

Model Comparison by Type

Capability Ablation (GPT-5-Mini)

Performance by Subtask

Reasoning vs Performance

Coding vs Performance

Optimal Reasoning/Coding Mix