Explaining the LLM--Human Gap in Jabberwocky Interpretation

Interactive exploration of why LLMs outperform humans at recovering meaning from Jabberwockified text. Based on Lupyan et al. (2026, arXiv:2601.11432).

Key Findings

LLMs and humans rely on the same morphosyntactic cues in the same priority order (Pearson r up to 0.985), but LLMs integrate them more effectively, exhibiting degradation slopes 38--62% shallower than humans.
0.915
Human Accuracy (all cues)
0.981
GPT-4 Accuracy (all cues)
0.985
Max Sensitivity Corr. (r)
0.125
Human Degrad. Slope
0.077
GPT-4 Degrad. Slope

Cue Ablation Profiles

Cumulative Degradation Curves

Gap Decomposition

Performance Gap vs. Complexity

Cue Sensitivity Correlation (Human vs. LLM)

LLMPearson rp-valueKendall τp-valueInterpretation
GPT-40.8070.0520.6000.136Strong positive
Claude0.8530.0310.4670.272Strong positive (sig.)
LLaMA-70B0.8130.0490.2000.719Strong positive (sig.)
LLaMA-7B0.985<0.0011.0000.003Near-perfect (sig.)

Scaling Analysis