Evaluating the Efficacy of LLM-Based Reviewer Agents in Scientific Peer Review

A multi-agent simulation study measuring decision accuracy, score calibration, defect detection, and adversarial robustness of LLM reviewer agent panels.

160 Manuscripts 267 Planted Defects 5 Agent Profiles 5 Attack Types

1 Efficacy Summary

Decision Accuracy
95.0%
Meta-reviewer (5 agents)
Cohen's Kappa
0.925
Near-perfect agreement
Score Calibration
0.988
Pearson r (mean)
Defect F1
0.987
Union aggregation
Inter-Agent Kappa
0.724
Mean pairwise

Multi-agent aggregation of LLM reviewer agents achieves strong performance on standard review tasks. A panel of 5 diverse agents with confidence-weighted aggregation correctly classifies 95% of manuscripts and detects 97.4% of planted defects with zero false positives. However, adversarial prompt injection inflates scores by +0.90 points (on a 10-point scale) and flips 16.7% of decisions, posing a critical barrier to unsupervised deployment.


2 Individual Agent Performance

Performance Comparison (160 manuscripts, 267 defects)

Agent Profile Accuracy Cohen's Kappa Calibration (r) Defect F1 Assessment
Accurate Generalist 0.9380.9060.9550.746 Balanced
Methods-Focused 0.9560.9340.9680.813 Best Individual
Novelty-Focused 0.9380.9060.9370.587 Weak Detection
Harsh Reviewer 0.7500.6280.9290.797 Biased Low
Lenient Reviewer 0.8310.7480.9280.602 Sycophancy

Interactive Comparison


3 Meta-Reviewer Aggregation

Best Individual
95.6%
Methods-Focused
Majority Vote
95.0%
5-agent panel
Confidence-Weighted
95.0%
5-agent panel

Per-Class Accuracy (Meta-Reviewer)

Revise-class manuscripts are hardest to classify, as they occupy the boundary between accept and reject quality tiers.


4 Score Calibration by Dimension

Pearson Correlation (r) Between Agent and Ground-Truth Scores

All dimensions achieve r > 0.98 after multi-agent aggregation. High calibration results from averaging five diverse agents, which cancels systematic biases. Mean r = 0.988.


5 Defect Detection

Precision
1.000
Zero false positives
Recall
0.974
Union aggregation
F1 Score
0.987
Near-perfect detection

Detection Recall by Defect Type


6 Adversarial Robustness

Score Inflation and Decision Flip Rate by Perturbation Type

Perturbation Score Shift Std Dev Flip Rate Severity
Surface Polish +0.3170.4116.7% Moderate
Claim Inflation +0.2960.3926.7% Moderate
Citation Gaming +0.1410.4288.3% Low
Method Obfuscation +0.2120.51011.7% Moderate
Adversarial Prompt +0.9010.38716.7% Critical

Defect Detection F1 Under Adversarial Perturbation

Method Obfuscation causes the largest drop in defect detection F1 (from 0.698 to 0.642), confirming that obfuscation specifically masks methodological flaws.


7 Panel Size Ablation

Metrics vs. Number of Reviewer Agents

Accuracy
0.969
Cohen's Kappa
0.953
Calibration (r)
0.988
Defect F1
0.987
Panel SizeAccuracyKappaCalibrationDefect F1Defect RecallIAA
10.9500.9250.9590.7720.629
30.9810.9720.9830.9550.9140.875
50.9690.9530.9880.9870.9740.725
70.9750.9620.9900.9960.9930.757
90.9630.9440.9931.0001.0000.783

8 Key Findings & Recommendations

Finding 1: Multi-Agent Aggregation Is Essential

Individual agents vary substantially (accuracy 75.0%-95.6%; defect F1 0.587-0.813). Multi-agent aggregation with diverse profiles achieves 95.0% decision accuracy and 0.987 defect F1, demonstrating that ensemble review is the most viable deployment mode.

Finding 2: Union Defect Detection Is Highly Effective

Union aggregation achieves near-perfect recall (0.974) with perfect precision (1.000). LLM reviewer panels are most valuable as defect screening tools that surface potential issues for human assessment.

Finding 3: Adversarial Vulnerability Is a Critical Risk

Adversarial prompt injection causes +0.90 score inflation and 16.7% decision flip rate, confirming the signal-collapse concern. Input sanitization, adversarial training, and review provenance verification are needed before deployment.

Finding 4: Bias Profiles Create Predictable Trade-offs

Harsh reviewers have high defect sensitivity but poor accuracy; lenient reviewers (sycophancy) have the opposite profile. Careful panel composition with diverse bias profiles improves overall robustness.

Finding 5: Diminishing Returns Suggest Practical Panel Sizes

3-5 agents provide the best cost-efficacy trade-off for decision accuracy. Defect detection continues improving up to 9 agents. Resource-constrained deployments should use diverse 3-agent panels; high-stakes reviews warrant larger panels.

Deployment Recommendation

LLM reviewer agents should be deployed as ensemble screening assistants with human oversight. A panel of 3-5 diverse agents provides structured pre-reviews (defect alerts, dimension scores, confidence flags) that augment human judgment rather than replace it. Adversarial hardening and input sanitization are prerequisites for any editorial pipeline integration.