A systematic study of score range adjustment for bias mitigation in LLM-as-a-Judge evaluations across tasks, alignment methods, and scoring configurations
Score range adjustment -- widening the discrete scale offered to an LLM judge -- has been proposed as a bias mitigation strategy, but its generalizability across tasks, alignment methods, and scoring configurations remains an open question (Sato et al., 2026). This work provides a systematic generalizability audit across 5 task types, 4 alignment profiles, and 7 scale granularities (175 conditions).
Widening K=5 to K=50 improves Spearman rank correlation in 84% of conditions. The improvement is monotonically increasing with K.
EMD systematically increases with K, while Spearman improves. The distributional gap is rescaled rather than resolved.
Wider ranges encode more information in raw scores. At narrow K, isotonic calibration can actually reduce rank correlation.
Select a task to see how Spearman rho changes with K across alignment profiles. Rank correlation improves monotonically in nearly all conditions.
| Task | Alignment | K base | K adapted | Base Spearman | Adapted Spearman | Change |
|---|
| Task | Distribution | Mean | Variance |
|---|---|---|---|
| Summarization | Beta(4,4) | 0.498 | 0.028 |
| Translation | Beta(2,2) | 0.500 | 0.050 |
| Open Generation | Uniform(0,1) | 0.499 | 0.084 |
| Code Review | Bimodal Beta | 0.560 | 0.101 |
| Essay Scoring | Beta(5,2) | 0.713 | 0.025 |
| Metric | Value |
|---|---|
| Task variance correlation with improvement | r = -0.477 (p = 0.016) |
| Conditions where range adjustment helps | 21 / 25 (84%) |
| Conditions that never worsen all metrics | 25 / 25 (100%) |
| Largest Spearman gain (K5 to K50) | +0.374 (Essay + Asym. DPO) |
| Recommended default for strong compression | K >= 20 |
| Recommended default for mild compression | K = 5-10 |