Validity of LLM-Simulated Users as Proxies

10.0%

Aggregate Miscalibration

Systematic bias

1.0

Kendall's tau (Rankings)

Perfect preservation

93.1%

Distinguishability (RF)

Easily separable

85,050

Total Interactions

24.3K human, 60.8K sim

Calibration Gap by Difficulty

Difficulty	Group	Human Rate	Simulated Rate	Gap	p-value