CL

When Does Widening the Scale Help?

A systematic study of score range adjustment for bias mitigation in LLM-as-a-Judge evaluations across tasks, alignment methods, and scoring configurations

Open Problem

Score range adjustment -- widening the discrete scale offered to an LLM judge -- has been proposed as a bias mitigation strategy, but its generalizability across tasks, alignment methods, and scoring configurations remains an open question (Sato et al., 2026). This work provides a systematic generalizability audit across 5 task types, 4 alignment profiles, and 7 scale granularities (175 conditions).

84%
Conditions Where Range Helps
Spearman rho improves
+0.374
Largest Spearman Gain
Essay + Asymmetric DPO
175
Experimental Conditions
5 tasks x 5 alignments x 7 K
r=-0.477
Task Variance Correlation
p=0.016, lower var = more benefit

Finding 1: Range adjustment broadly helps ordinal accuracy

Widening K=5 to K=50 improves Spearman rank correlation in 84% of conditions. The improvement is monotonically increasing with K.

Finding 2: Kurtosis reduction is inconsistent

EMD systematically increases with K, while Spearman improves. The distributional gap is rescaled rather than resolved.

Finding 3: Range adjustment complements calibration

Wider ranges encode more information in raw scores. At narrow K, isotonic calibration can actually reduce rank correlation.

Spearman Rank Correlation vs Scale Granularity (K)

Select a task to see how Spearman rho changes with K across alignment profiles. Rank correlation improves monotonically in nearly all conditions.

Adaptive Two-Pass Protocol: Spearman

Predictive Factors: Task Variance vs Improvement

Task Quality Distributions

Compression Functions

Adaptive Protocol Results

TaskAlignmentK baseK adaptedBase SpearmanAdapted SpearmanChange

Task Profiles

TaskDistributionMeanVariance
SummarizationBeta(4,4)0.4980.028
TranslationBeta(2,2)0.5000.050
Open GenerationUniform(0,1)0.4990.084
Code ReviewBimodal Beta0.5600.101
Essay ScoringBeta(5,2)0.7130.025

Predictive Summary

MetricValue
Task variance correlation with improvementr = -0.477 (p = 0.016)
Conditions where range adjustment helps21 / 25 (84%)
Conditions that never worsen all metrics25 / 25 (100%)
Largest Spearman gain (K5 to K50)+0.374 (Essay + Asym. DPO)
Recommended default for strong compressionK >= 20
Recommended default for mild compressionK = 5-10