Capability-Indexed Calibration Analysis

How Agent Model Capability Modulates Calibration Gaps and Demographic Disparities in Agentic Evaluations
43,200 interaction trials 9 agents x 8 demographics x 2 user types

Research Overview

Seshadri et al. (2026) demonstrated that LLM-simulated users are unreliable proxies for real human users in agentic evaluations. However, their study fixed the agent to GPT-4o. We investigate: Does the calibration gap depend on the agent's capability level?

Calibration Gap vs. Capability (Spearman, p < 0.001)
50%
Reduction in calibration gap from weakest to strongest agent
Cross-disparity gap vs. Capability (not significant, p = 0.576)
0.85
Capability breakpoint for accelerated calibration improvement
Summary Table: Metrics Across the Capability Spectrum
Agent Model Capability Cal. Gap Disp (Sim) Disp (Real) Cross-Disp SR (Sim) SR (Real)

Calibration Gap Analysis

The calibration gap measures the absolute difference between success rates with simulated vs. real users. Our key finding is that this gap decreases significantly with agent capability.

Calibration Gap vs. Capability
Success Rates by User Type
Per-Group Calibration Gaps

Groups with lower baseline clarity and proficiency exhibit higher calibration gaps, but all groups benefit from increased agent capability.

Demographic Disparity Analysis

Disparities measure the gap between the best- and worst-performing demographic groups. We compare disparity patterns for simulated vs. real users.

Disparity vs. Capability
Equalized Odds Difference
Cross-Disparity Gap

The cross-disparity gap measures how well simulated evaluations capture real disparity patterns. Despite decreasing calibration gaps, this metric does not improve monotonically.

Interactive Capability Explorer

Adjust the capability slider to see how agent sub-capabilities and predicted outcomes change.

0.10 (Weak) 0.50 (Mid) 0.99 (Frontier)
0.00 (None) 0.20 (Default) 0.40 (High)
0.77
Instruction Following
0.93
Error Recovery
0.52
Accommodation
0.088
Est. Calibration Gap
Sub-Capability Scaling

The three sub-capabilities scale differently: instruction following is linear, error recovery is sigmoidal (emerging around mid-capability), and accommodation is quadratic (late emergence).

Key Findings

1
Calibration gap decreases with capability.

Spearman rho = -0.90, p < 0.001. The calibration gap drops from 0.095 (phi-3-mini) to 0.048 (frontier-2026), a reduction of approximately 50%. Significant

2
Demographic disparities decrease, but weakly.

Real-user disparity shows a negative but not statistically significant trend (rho = -0.56, p = 0.116). Simulated-user disparity decreases more clearly (rho = -0.70, p = 0.036). Trend

3
Cross-disparity gap does NOT improve monotonically.

The ability of simulated evaluations to capture the correct disparity pattern does not systematically improve with capability (rho = +0.22, p = 0.576). This means fairness audits using simulated users remain unreliable regardless of agent capability. Warning

4
Phase transition at capability 0.85.

Changepoint analysis reveals an accelerated improvement in calibration gap above theta = 0.85 (right-segment slope = -0.332 vs. left-segment slope = -0.034). This suggests qualitative changes in frontier-regime models.

5
Accommodation is the key mechanism.

The quadratic scaling of accommodation (theta^2) drives the calibration gap reduction. More capable agents adapt to diverse communication styles, partially compensating for the gap between idealized simulated and noisy real users.

Methods Summary

Agent Models (9)
ModelCapabilityInstr.FollowErr.Recov.Accomm.
Demographic Groups (8)
GroupClarityToleranceProficiency
Reference

Based on the open problem from Seshadri et al. (2026), "Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations" (arXiv: 2601.17087). The study acknowledged: "We cannot assess whether these issues vary across agents of different capabilities."