Selective WER: Evaluating WER Under Selective Prediction in Long-Form ASR

A principled framework for computing Word Error Rate when a subset of predicted words is intentionally ignored based on word-level uncertainty, enabling fair evaluation under selective prediction settings.

cs.CL - Computation and Language arXiv:2601.18415
Key Results
0.460
AURCC (Good Cal.)
0.583
AURCC (Random)
0.004
aWER (Threshold)
0.160
aWER (Random)
76.7%
Error Targeting (Good)
Framework

sWER (Selective WER)

sWER = (S + D + I) / N

Treats abstentions as deletions. Cannot be gamed by abstaining. Always greater than or equal to standard WER.

aWER (Abstention-Aware)

aWER = (S + I) / (N - A_total)

Error rate over committed words only. Must be reported alongside coverage. Rewards targeted abstention.

AURCC

AURCC = integral of Risk over Coverage

Scalar summary across all operating points. Lower is better. Discriminates calibration quality.

Experiment 1: Metrics Across Error Rates

WER Metrics vs Error Rate

Data Table

Error RateStd WERsWERaWERCoverage
5%0.0670.2060.00081.6%
10%0.1580.2360.02081.2%
15%0.1760.2420.00081.5%
20%0.2240.2360.01482.5%
25%0.3390.3520.13181.8%
30%0.3030.3090.07981.8%
Experiment 2: Risk-Coverage Curves

Risk vs Coverage by Calibration

AURCC Summary

CalibrationAURCCStd Dev
Good0.4600.041
Noisy0.4200.062
Random0.5830.070

Interpretation

Lower AURCC indicates better selective prediction. Well-calibrated uncertainty achieves 21% lower AURCC than random scores, confirming that calibration quality is essential for effective selective prediction.

Experiment 3: Transcript Length Scaling

Metrics vs Transcript Length

Length Scaling Table

Ref LengthStd WERsWERAURCC
33 words0.1720.2220.447
62 words0.2100.2260.445
121 words0.1900.2530.466

Interpretation

AURCC remains stable (0.445 to 0.466) across transcript lengths, demonstrating that the framework scales well to longer transcripts without degradation.

Experiment 4: Abstention Strategy Comparison

sWER and aWER by Strategy

Strategy Comparison at ~79% Coverage

StrategysWERaWERCoverage
Oracle0.2520.00478.9%
Uncertainty Threshold0.2520.00478.9%
Random0.3670.16078.9%

Key Finding

Uncertainty-based threshold matches oracle performance (sWER = 0.252, aWER = 0.004), while random abstention yields 45.6% higher sWER (0.367) and 40x higher aWER (0.160).

Experiment 5: Oracle Decomposition

Error-Targeting Precision

Decomposition Table

Error RateCalibrationError FracCorrect Frac
10%Good0.1200.780
10%Noisy0.0800.860
10%Random0.0400.960
20%Good0.5110.444
20%Noisy0.3110.644
20%Random0.1560.822
30%Good0.7670.139
30%Noisy0.5530.403
30%Random0.2610.692
Comprehensive Summary

Main Results (15% Error Rate, 80% Coverage, 8 Trials)

CalibrationStd WERsWERaWERAURCC
Good 0.186 ± 0.064 0.216 ± 0.019 0.009 ± 0.024 0.448 ± 0.049
Noisy 0.186 ± 0.064 0.273 ± 0.032 0.072 ± 0.026 0.421 ± 0.048
Random 0.186 ± 0.064 0.333 ± 0.065 0.156 ± 0.077 0.566 ± 0.042