10.0%
Aggregate Miscalibration
Systematic bias
1.0
Kendall's tau (Rankings)
Perfect preservation
93.1%
Distinguishability (RF)
Easily separable
85,050
Total Interactions
24.3K human, 60.8K sim
Calibration Gap by Difficulty
Agent Rankings: Human vs Simulated
Calibration by Country (Hard Tasks)
Feature Distinguishability (KS Statistic)
Calibration by Age Group and Difficulty
Detailed Calibration Table
| Difficulty | Group | Human Rate | Simulated Rate | Gap | p-value |