Validity of LLM-Simulated Users as Proxies for Real Users

Large-scale study (85,050 interactions) measuring calibration gaps between simulated and real users across difficulty levels, countries, and age groups.

10.0%
Aggregate Miscalibration
Systematic bias
1.0
Kendall's tau (Rankings)
Perfect preservation
93.1%
Distinguishability (RF)
Easily separable
85,050
Total Interactions
24.3K human, 60.8K sim

Calibration Gap by Difficulty

Agent Rankings: Human vs Simulated

Calibration by Country (Hard Tasks)

Feature Distinguishability (KS Statistic)

Calibration by Age Group and Difficulty

Detailed Calibration Table

DifficultyGroupHuman RateSimulated RateGapp-value