Selective WER: Evaluating WER Under Selective Prediction in Long-Form ASR

Key Results

0.460

AURCC (Good Cal.)

0.583

AURCC (Random)

0.004

aWER (Threshold)

0.160

aWER (Random)

76.7%

Error Targeting (Good)

Framework

sWER (Selective WER)

sWER = (S + D + I) / N

Treats abstentions as deletions. Cannot be gamed by abstaining. Always greater than or equal to standard WER.

aWER (Abstention-Aware)

aWER = (S + I) / (N - A_total)

Error rate over committed words only. Must be reported alongside coverage. Rewards targeted abstention.

AURCC

AURCC = integral of Risk over Coverage

Scalar summary across all operating points. Lower is better. Discriminates calibration quality.

Experiment 1: Metrics Across Error Rates

WER Metrics vs Error Rate

Data Table

Error Rate	Std WER	sWER	aWER	Coverage
5%	0.067	0.206	0.000	81.6%
10%	0.158	0.236	0.020	81.2%
15%	0.176	0.242	0.000	81.5%
20%	0.224	0.236	0.014	82.5%
25%	0.339	0.352	0.131	81.8%
30%	0.303	0.309	0.079	81.8%

Experiment 2: Risk-Coverage Curves

Risk vs Coverage by Calibration

AURCC Summary

Calibration	AURCC	Std Dev
Good	0.460	0.041
Noisy	0.420	0.062
Random	0.583	0.070

Interpretation

Lower AURCC indicates better selective prediction. Well-calibrated uncertainty achieves 21% lower AURCC than random scores, confirming that calibration quality is essential for effective selective prediction.

Experiment 3: Transcript Length Scaling

Metrics vs Transcript Length

Length Scaling Table

Ref Length	Std WER	sWER	AURCC
33 words	0.172	0.222	0.447
62 words	0.210	0.226	0.445
121 words	0.190	0.253	0.466

Interpretation

AURCC remains stable (0.445 to 0.466) across transcript lengths, demonstrating that the framework scales well to longer transcripts without degradation.

Experiment 4: Abstention Strategy Comparison

sWER and aWER by Strategy

Strategy Comparison at ~79% Coverage

Strategy	sWER	aWER	Coverage
Oracle	0.252	0.004	78.9%
Uncertainty Threshold	0.252	0.004	78.9%
Random	0.367	0.160	78.9%

Key Finding

Uncertainty-based threshold matches oracle performance (sWER = 0.252, aWER = 0.004), while random abstention yields 45.6% higher sWER (0.367) and 40x higher aWER (0.160).

Experiment 5: Oracle Decomposition

Error-Targeting Precision

Decomposition Table

Error Rate	Calibration	Error Frac	Correct Frac
10%	Good	0.120	0.780
10%	Noisy	0.080	0.860
10%	Random	0.040	0.960
20%	Good	0.511	0.444
20%	Noisy	0.311	0.644
20%	Random	0.156	0.822
30%	Good	0.767	0.139
30%	Noisy	0.553	0.403
30%	Random	0.261	0.692

Comprehensive Summary

Main Results (15% Error Rate, 80% Coverage, 8 Trials)

Calibration	Std WER	sWER	aWER	AURCC
Good	0.186 ± 0.064	0.216 ± 0.019	0.009 ± 0.024	0.448 ± 0.049
Noisy	0.186 ± 0.064	0.273 ± 0.032	0.072 ± 0.026	0.421 ± 0.048
Random	0.186 ± 0.064	0.333 ± 0.065	0.156 ± 0.077	0.566 ± 0.042