Marginalizing Over BPE Tokenizations for Calibrated Word-Level Probabilities in Whisper

BPE tokenizers map a single word to multiple valid token sequences, causing the true word-level probability to be distributed across exponentially many paths through the decoder. We construct a segmentation DAG whose paths enumerate all valid tokenizations, and propose a forward algorithm that marginalizes over this DAG. Across 184 English words, we find that canonical estimates capture as little as 34% of the true word probability, and that beam-pruned marginalization with width 10 recovers over 96% of the exact marginal.

The Problem: Hidden Probability Mass

Whisper and similar ASR systems use byte-pair encoding (BPE) to segment text into subword tokens. When estimating word-level confidence, practitioners multiply the conditional probabilities of the tokens in the canonical BPE segmentation. But BPE tokenizers are ambiguous: the same word admits multiple valid token sequences. This means the true word probability requires summing over all valid tokenizations.

P(word | audio) = Σs ∈ S(word) P(s | audio)

Using only the canonical segmentation yields a lower bound on the true word probability. The difference is the marginalization gap:

Δ(w) = log P(w | audio) − log P(s* | audio) ≥ 0
184
English words analyzed
50,257
BPE vocabulary entries
0.26
Mean gap (nats, single case)
1.09
Mean gap (nats, with case variants)
Key insight: A gap of 0.26 nats means the canonical estimate captures only ~77% of the true word probability on average. With case variants, canonical estimates capture as little as ~34% of the true probability.

Methodology: Segmentation DAG & Forward Algorithm

We model all valid tokenizations of a word as paths through a directed acyclic graph (DAG), then run a forward algorithm to compute the exact marginalized probability.

1
Build Segmentation DAG: Given word string of length n, create nodes {0, 1, ..., n} for character positions. Add edge (i, j, token_id) whenever substring w[i:j] matches a vocabulary token. Source-to-sink paths enumerate all valid tokenizations.
2
Enumerate Case Variants: Construct separate DAGs for lowercase, title-case, and uppercase forms. The total probability sums over all variant DAGs.
3
Forward Algorithm: Traverse DAG in topological order. At each node, maintain partial path histories with accumulated log-probabilities. Propagate using log-sum-exp to aggregate probability at the sink node.
4
Beam Pruning: For words with thousands of paths, retain only the B most probable partial paths at each node. This produces a lower bound that improves monotonically with beam width.

Interactive: Tokenization Paths for "cat"

The word cat (with space prefix:  cat) has 8 valid tokenization paths through Whisper's GPT-2 vocabulary. The canonical single-token path captures 99.5% of probability mass, but alternative decompositions collectively contribute measurable additional probability.

Observation: All 3-character words have exactly 8 tokenization paths: one canonical, three 2-token splits, three 3-token splits, and one full character-level decomposition. The canonical path dominates with 99.5% of probability mass.

Tokenization Paths Grow Exponentially

The number of valid BPE tokenizations grows exponentially with word length, following approximately 100.35n growth. Each additional character roughly doubles the number of valid tokenizations by introducing new split points.

Length N Words Median Paths Max Paths Median Edges Gap (single) Gap (+case)
2-3204460.0051.104
4-5811516140.0291.118
6-75893110260.5351.229
8-101493493340.2261.010
11+243,0065,337460.8591.293

The Marginalization Gap

The marginalization gap, the log-probability missed by using only the canonical tokenization, grows with both word length and the number of tokenization paths. For complex words (11+ characters), the gap reaches 0.86 nats (single case) and 1.29 nats (with case variants).

1
The single-case marginalization gap ranges from 0.005 nats for short words to 1.58 nats for complex words. In probability space, a 0.26 nat mean gap means the canonical estimate captures only exp(-0.26) = 77% of the true word probability.
2
Including casing variants raises the mean gap to 1.09 nats, meaning canonical estimates capture only exp(-1.09) = 34% of the total probability. The case-inclusive gap is relatively constant across word lengths (~1.10-1.29 nats), suggesting casing ambiguity is a length-independent source of probability dispersion.

Beam Search Convergence

Full marginalization can be expensive for long words with thousands of paths. Beam-pruned marginalization retains only the top-B partial paths at each DAG node. The approximation converges rapidly because probability is concentrated in a few dominant paths.

Word Paths B=5 B=10 B=20 B=50
cat899.99%100.0%100.0%100.0%
playing8099.6%99.9%100.0%100.0%
application1,01194.3%98.8%99.8%100.0%
international3,64287.6%96.0%99.2%99.9%
3
Beam width B=10 recovers >99.9% of exact marginal probability for words up to 7 characters and >96% for words up to 13 characters. The rapid convergence occurs because the probability distribution over paths is highly concentrated in the top few paths.

Case Variant Contributions

In BPE vocabularies, cat, Cat, and CAT are distinct tokens with separate token IDs. Ignoring case variants loses a significant portion of the total word probability. The chart below shows how probability mass is distributed across lowercase, title-case, and uppercase variants.

4
Case variants are separate token sequences in BPE vocabularies. In the mock decoder setting, the lowercase variant accounts for a median of ~38% of total probability, with title-case and uppercase capturing ~62%. Casing adds an average of 1.09 nats to the marginalization gap.

Summary of Key Results

1
Exponential path growth: Valid tokenization paths grow as ~100.35n with word length n. Median paths range from 4 (2-3 char words) to 3,006 (11+ char words), reaching a maximum of 5,337 for "understanding."
2
Systematic underestimation: The marginalization gap averages 0.26 nats (single case) and 1.09 nats (with case variants). Canonical estimates capture as little as 34% of the true word probability, introducing a length-dependent bias in confidence scores.
3
Practical approximation: Beam-pruned marginalization with B=10 recovers >96% of exact probability even for 13-character words, requiring at most 460 decoder queries. This makes calibrated word-level uncertainty practical for real-time ASR.
4
Compact representation: While the number of paths is exponential, the DAG representation grows only quadratically (O(n2) edges), enabling efficient algorithms. The DAG structure is independent of the decoder and applies to any autoregressive BPE-based model.