Seshadri et al. (2026) demonstrated that LLM-simulated users are unreliable proxies for real human users in agentic evaluations. However, their study fixed the agent to GPT-4o. We investigate: Does the calibration gap depend on the agent's capability level?
| Agent Model | Capability | Cal. Gap | Disp (Sim) | Disp (Real) | Cross-Disp | SR (Sim) | SR (Real) |
|---|
The calibration gap measures the absolute difference between success rates with simulated vs. real users. Our key finding is that this gap decreases significantly with agent capability.
Groups with lower baseline clarity and proficiency exhibit higher calibration gaps, but all groups benefit from increased agent capability.
Disparities measure the gap between the best- and worst-performing demographic groups. We compare disparity patterns for simulated vs. real users.
The cross-disparity gap measures how well simulated evaluations capture real disparity patterns. Despite decreasing calibration gaps, this metric does not improve monotonically.
Adjust the capability slider to see how agent sub-capabilities and predicted outcomes change.
The three sub-capabilities scale differently: instruction following is linear, error recovery is sigmoidal (emerging around mid-capability), and accommodation is quadratic (late emergence).
Spearman rho = -0.90, p < 0.001. The calibration gap drops from 0.095 (phi-3-mini) to 0.048 (frontier-2026), a reduction of approximately 50%. Significant
Real-user disparity shows a negative but not statistically significant trend (rho = -0.56, p = 0.116). Simulated-user disparity decreases more clearly (rho = -0.70, p = 0.036). Trend
The ability of simulated evaluations to capture the correct disparity pattern does not systematically improve with capability (rho = +0.22, p = 0.576). This means fairness audits using simulated users remain unreliable regardless of agent capability. Warning
Changepoint analysis reveals an accelerated improvement in calibration gap above theta = 0.85 (right-segment slope = -0.332 vs. left-segment slope = -0.034). This suggests qualitative changes in frontier-regime models.
The quadratic scaling of accommodation (theta^2) drives the calibration gap reduction. More capable agents adapt to diverse communication styles, partially compensating for the gap between idealized simulated and noisy real users.
| Model | Capability | Instr.Follow | Err.Recov. | Accomm. |
|---|
| Group | Clarity | Tolerance | Proficiency |
|---|
Based on the open problem from Seshadri et al. (2026), "Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations" (arXiv: 2601.17087). The study acknowledged: "We cannot assess whether these issues vary across agents of different capabilities."