A multi-agent simulation study measuring decision accuracy, score calibration, defect detection, and adversarial robustness of LLM reviewer agent panels.
Multi-agent aggregation of LLM reviewer agents achieves strong performance on standard review tasks. A panel of 5 diverse agents with confidence-weighted aggregation correctly classifies 95% of manuscripts and detects 97.4% of planted defects with zero false positives. However, adversarial prompt injection inflates scores by +0.90 points (on a 10-point scale) and flips 16.7% of decisions, posing a critical barrier to unsupervised deployment.
| Agent Profile | Accuracy | Cohen's Kappa | Calibration (r) | Defect F1 | Assessment |
|---|---|---|---|---|---|
| Accurate Generalist | 0.938 | 0.906 | 0.955 | 0.746 | Balanced |
| Methods-Focused | 0.956 | 0.934 | 0.968 | 0.813 | Best Individual |
| Novelty-Focused | 0.938 | 0.906 | 0.937 | 0.587 | Weak Detection |
| Harsh Reviewer | 0.750 | 0.628 | 0.929 | 0.797 | Biased Low |
| Lenient Reviewer | 0.831 | 0.748 | 0.928 | 0.602 | Sycophancy |
Revise-class manuscripts are hardest to classify, as they occupy the boundary between accept and reject quality tiers.
All dimensions achieve r > 0.98 after multi-agent aggregation. High calibration results from averaging five diverse agents, which cancels systematic biases. Mean r = 0.988.
| Perturbation | Score Shift | Std Dev | Flip Rate | Severity |
|---|---|---|---|---|
| Surface Polish | +0.317 | 0.411 | 6.7% | Moderate |
| Claim Inflation | +0.296 | 0.392 | 6.7% | Moderate |
| Citation Gaming | +0.141 | 0.428 | 8.3% | Low |
| Method Obfuscation | +0.212 | 0.510 | 11.7% | Moderate |
| Adversarial Prompt | +0.901 | 0.387 | 16.7% | Critical |
Method Obfuscation causes the largest drop in defect detection F1 (from 0.698 to 0.642), confirming that obfuscation specifically masks methodological flaws.
| Panel Size | Accuracy | Kappa | Calibration | Defect F1 | Defect Recall | IAA |
|---|---|---|---|---|---|---|
| 1 | 0.950 | 0.925 | 0.959 | 0.772 | 0.629 | |
| 3 | 0.981 | 0.972 | 0.983 | 0.955 | 0.914 | 0.875 |
| 5 | 0.969 | 0.953 | 0.988 | 0.987 | 0.974 | 0.725 |
| 7 | 0.975 | 0.962 | 0.990 | 0.996 | 0.993 | 0.757 |
| 9 | 0.963 | 0.944 | 0.993 | 1.000 | 1.000 | 0.783 |
Individual agents vary substantially (accuracy 75.0%-95.6%; defect F1 0.587-0.813). Multi-agent aggregation with diverse profiles achieves 95.0% decision accuracy and 0.987 defect F1, demonstrating that ensemble review is the most viable deployment mode.
Union aggregation achieves near-perfect recall (0.974) with perfect precision (1.000). LLM reviewer panels are most valuable as defect screening tools that surface potential issues for human assessment.
Adversarial prompt injection causes +0.90 score inflation and 16.7% decision flip rate, confirming the signal-collapse concern. Input sanitization, adversarial training, and review provenance verification are needed before deployment.
Harsh reviewers have high defect sensitivity but poor accuracy; lenient reviewers (sycophancy) have the opposite profile. Careful panel composition with diverse bias profiles improves overall robustness.
3-5 agents provide the best cost-efficacy trade-off for decision accuracy. Defect detection continues improving up to 9 agents. Resource-constrained deployments should use diverse 3-agent panels; high-stakes reviews warrant larger panels.
LLM reviewer agents should be deployed as ensemble screening assistants with human oversight. A panel of 3-5 diverse agents provides structured pre-reviews (defect alerts, dimension scores, confidence flags) that augment human judgment rather than replace it. Adversarial hardening and input sanitization are prerequisites for any editorial pipeline integration.