LLM--Human Gap in Jabberwocky Interpretation

Key Findings

LLMs and humans rely on the same morphosyntactic cues in the same priority order (Pearson r up to 0.985), but LLMs integrate them more effectively, exhibiting degradation slopes 38--62% shallower than humans.

0.915

Human Accuracy (all cues)

0.981

GPT-4 Accuracy (all cues)

0.985

Max Sensitivity Corr. (r)

0.125

Human Degrad. Slope

0.077

GPT-4 Degrad. Slope

Cue Ablation Profiles

Agent:

Cumulative Degradation Curves

Gap Decomposition

Performance Gap vs. Complexity

View:

Cue Sensitivity Correlation (Human vs. LLM)

LLM	Pearson r	p-value	Kendall τ	p-value	Interpretation
GPT-4	0.807	0.052	0.600	0.136	Strong positive
Claude	0.853	0.031	0.467	0.272	Strong positive (sig.)
LLaMA-70B	0.813	0.049	0.200	0.719	Strong positive (sig.)
LLaMA-7B	0.985	<0.001	1.000	0.003	Near-perfect (sig.)