A principled framework for computing Word Error Rate when a subset of predicted words is intentionally ignored based on word-level uncertainty, enabling fair evaluation under selective prediction settings.
cs.CL - Computation and Language arXiv:2601.18415Treats abstentions as deletions. Cannot be gamed by abstaining. Always greater than or equal to standard WER.
Error rate over committed words only. Must be reported alongside coverage. Rewards targeted abstention.
Scalar summary across all operating points. Lower is better. Discriminates calibration quality.
| Error Rate | Std WER | sWER | aWER | Coverage |
|---|---|---|---|---|
| 5% | 0.067 | 0.206 | 0.000 | 81.6% |
| 10% | 0.158 | 0.236 | 0.020 | 81.2% |
| 15% | 0.176 | 0.242 | 0.000 | 81.5% |
| 20% | 0.224 | 0.236 | 0.014 | 82.5% |
| 25% | 0.339 | 0.352 | 0.131 | 81.8% |
| 30% | 0.303 | 0.309 | 0.079 | 81.8% |
| Calibration | AURCC | Std Dev |
|---|---|---|
| Good | 0.460 | 0.041 |
| Noisy | 0.420 | 0.062 |
| Random | 0.583 | 0.070 |
Lower AURCC indicates better selective prediction. Well-calibrated uncertainty achieves 21% lower AURCC than random scores, confirming that calibration quality is essential for effective selective prediction.
| Ref Length | Std WER | sWER | AURCC |
|---|---|---|---|
| 33 words | 0.172 | 0.222 | 0.447 |
| 62 words | 0.210 | 0.226 | 0.445 |
| 121 words | 0.190 | 0.253 | 0.466 |
AURCC remains stable (0.445 to 0.466) across transcript lengths, demonstrating that the framework scales well to longer transcripts without degradation.
| Strategy | sWER | aWER | Coverage |
|---|---|---|---|
| Oracle | 0.252 | 0.004 | 78.9% |
| Uncertainty Threshold | 0.252 | 0.004 | 78.9% |
| Random | 0.367 | 0.160 | 78.9% |
Uncertainty-based threshold matches oracle performance (sWER = 0.252, aWER = 0.004), while random abstention yields 45.6% higher sWER (0.367) and 40x higher aWER (0.160).
| Error Rate | Calibration | Error Frac | Correct Frac |
|---|---|---|---|
| 10% | Good | 0.120 | 0.780 |
| 10% | Noisy | 0.080 | 0.860 |
| 10% | Random | 0.040 | 0.960 |
| 20% | Good | 0.511 | 0.444 |
| 20% | Noisy | 0.311 | 0.644 |
| 20% | Random | 0.156 | 0.822 |
| 30% | Good | 0.767 | 0.139 |
| 30% | Noisy | 0.553 | 0.403 |
| 30% | Random | 0.261 | 0.692 |
| Calibration | Std WER | sWER | aWER | AURCC |
|---|---|---|---|---|
| Good | 0.186 ± 0.064 | 0.216 ± 0.019 | 0.009 ± 0.024 | 0.448 ± 0.049 |
| Noisy | 0.186 ± 0.064 | 0.273 ± 0.032 | 0.072 ± 0.026 | 0.421 ± 0.048 |
| Random | 0.186 ± 0.064 | 0.333 ± 0.065 | 0.156 ± 0.077 | 0.566 ± 0.042 |