Impact of User Communication Diversity on Agent Performance

A framework for quantifying how variation in user communication styles -- formality, verbosity, politeness, dialect, cultural context, and domain expertise -- affects LLM agent task success.

19,200
Simulated Dialogues
0.608
Max CDSI (Sensitivity)
7.5%
Max Style Contribution
All Effects Significant

Key Findings

1. Communication diversity has a measurable and significant impact on task success. CDSI scores range from 0.259 (robust) to 0.608 (highly sensitive).
2. Dialect distance and cultural context are the most impactful dimensions, with Spearman correlations reaching rho = -0.37.
3. Style accounts for 1.5%--7.5% of task outcome uncertainty -- exceeding the contribution of task domain by 2--10x.
4. Agents show systematic overconfidence for non-standard communicators, with calibration gaps reaching 0.80.
5. Performance degradation follows a smooth exponential decay model with style distance.

Interactive: Communication Style Explorer

Adjust the style dimensions to see how they affect predicted task success. The model uses exponential decay: P(correct) = alpha * exp(-beta * weighted_distance).

Predicted Task Success
0.920
Style Distance: 0.000
Weighted Distance: 0.000

Task Success Rate by User Profile

Communication Diversity Sensitivity Index (CDSI)

CDSI(agent) = 1 - E[SR(s)] / SR(s_0) where s ranges over non-standard styles

Per-Axis Sensitivity Analysis

Spearman correlation (rho) between each style dimension and task success. Negative values indicate that higher values on that axis degrade performance.

Information-Theoretic Decomposition

Interpretation

We decompose the total entropy of task success H(S) into three components:

I(S; Content)0.006--0.009 bits
I(S; Style | Content)0.015--0.072 bits
H(S | Content, Style)0.845--0.971 bits
Communication style explains 2--10x more variance than task domain alone. The style contribution ratio ranges from 1.55% to 7.54%.

Equity Analysis: Calibration Gaps by Group

Agents maintain ~90% confidence regardless of user style, but actual success varies from 9.8% to 80.5%. The calibration gap reaches 0.80 for L2 speakers under the Dialect Vulnerable agent.

Statistical Significance

AgentChi-squaredp-valueSig.Kruskal-Wallis Hp-valueSig.
Low Sens.87.67.08e-19***140.01.60e-24***
Mod. Sens.271.71.36e-58***374.81.37e-73***
High Sens.462.66.15e-100***566.42.18e-114***
Dialect Vuln.457.38.48e-99***931.21.21e-192***

Methodology Overview

1. Style Space

6-dimensional parameterization grounded in sociolinguistic theory: formality, verbosity, politeness, dialect distance, cultural context, domain expertise.

2. User Profiles

12 canonical profiles spanning the style space, from baseline to AAVE, L2, high-context, elderly, teen, expert, and corporate communicators.

3. Agent Model

Exponential decay: P(correct|s) = alpha * exp(-beta * d_w(s)). Four agent configurations varying in sensitivity and axis weighting.

4. CDSI Metric

CDSI = 1 - mean(SR_non-standard) / SR_baseline. Scalar summary of agent robustness with per-axis and equity decomposition.

5. Info Decomposition

Mutual information analysis separating content (what) from style (how) contributions to task outcome uncertainty.

6. Statistical Tests

Chi-squared test (style vs. success) and Kruskal-Wallis H test (demographic group differences). All significant at p less than 10^-18.

References

  1. Seshadri et al. (2026). Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations. arXiv:2601.17087.
  2. Truong et al. (2025). Persona-Driven Interaction: Evaluating LLM User Simulation. EMNLP.
  3. Blodgett et al. (2020). Language (Technology) is Power: A Critical Survey of Bias in NLP. ACL.
  4. Brown and Levinson (1987). Politeness: Some Universals in Language Usage. Cambridge University Press.
  5. Biber (1995). Dimensions of Register Variation. Cambridge University Press.
  6. Hall (1976). Beyond Culture. Anchor Books.