CharToM-QA: Context Length vs ToM Difficulty

Variance Decomposition

Theory-of-mind order explains nearly 4x more variance than context length.

Higher-order ToM questions show steeper context-length degradation.

Marginal accuracy by context length (left) and ToM order (right).

ToM is primary: 75% of difficulty comes from ToM reasoning, not context length.
Context is secondary: 19% context contribution is significant but not dominant.
Interaction exists: Longer contexts amplify 2nd-order ToM difficulty specifically.
Robust finding: Consistent across 5 model capability levels.