Evaluating Business-Policy Adherence of Customer Support LLM Agents

Standardized benchmarks for measuring SOP compliance in LLM-based customer support

cs.CLBalaji et al. 2026arXiv: 2601.00596
Overview
Agents
Complexity
Robustness
Eval Methods
Multi-Turn
0.829
Best UJCS (Claude-3.5)
0.907
Hybrid Eval F1
5
Agents Tested
20
Max SOP Steps

Key Findings

AgentUJCSAdherenceStep Compl.Depend.
Claude-3.50.8290.8470.8700.790
GPT-4o0.7930.8100.8360.754
Gemini-Pro0.7590.7760.8010.720
Mistral-Large0.7150.7310.7580.677
Llama-70B0.6790.6950.7200.639

Agent Comparison (5-Step SOP)

UJCS vs SOP Complexity

Robustness Under Disturbances

Evaluation Methodology Comparison

Multi-Turn Adherence