Dictation-style speech recognition produces transcripts containing out-of-context insertions -- procedural commands (e.g., "new paragraph," "semicolon") and ambient speech fragments (e.g., "can you hear me in the back") that are transcribed verbatim alongside intended text.
Bondarenko et al. (2026) reported that a single-head Longformer model successfully segmented paragraphs but failed to remove a sufficient number of such inserts. The single-head approach achieves only 0.362 recall on REMOVE tokens, meaning it misses 63.8% of insertions.
This paper addresses the open problem: How can we reliably detect and remove out-of-context insertions while preserving correct paragraph segmentation?
Given a token sequence produced by ASR, predict labels for each token: KEEP (retain in output), REMOVE (discard), or PARA_BREAK (insert paragraph boundary). KEEP tokens dominate at 79-93% depending on insert density, creating severe class imbalance.
We decompose the task into two specialized sub-tasks via a dual-head architecture, each head specializing in one sub-task, unified through a CRF fusion layer that enforces structural constraints.
Coherence Gating: Computes per-token gate value using the token hidden state and [CLS] representation. Tokens dissimilar to the global document representation receive amplified REMOVE logits.
Focal Loss: Addresses class imbalance with gamma=2.0 and class weights alpha=[0.3, 0.7] to up-weight the minority REMOVE class.
CRF Fusion: A linear-chain CRF models transition constraints: penalizes isolated single-token REMOVE predictions and prevents adjacent REMOVE and PARA_BREAK labels.
Loss Weighting: Total loss is L_CRF + 0.3*L_para + 0.7*L_insert, explicitly prioritizing the harder insert-removal sub-task.
Results on the combined test set (40 samples across all density levels, 12% mean insert density).
| Method | R-Precision | R-Recall | R-F1 | Para F1 | Word F1 | |
|---|---|---|---|---|---|---|
| Rule-Based | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | Lexicon Dependent |
| Coherence-Based | 0.102 | 1.000 | 0.184 | 0.000 | 0.000 | |
| Single-Head Longformer | 0.950 | 0.362 | 0.514 | 0.925 | 0.964 | Baseline |
| Dual-Head (Ours) | 0.932 | 0.867 | 0.891 | 0.908 | 0.988 | Best |
Performance across insert density levels from 3% to 30%. The single-head baseline plateaus around 0.51-0.60 regardless of density, confirming an architectural limitation. Our dual-head model maintains 0.812-0.936 across all densities.
Each row removes one component from the full model. The base dual-head model (without CRF, gate, or focal loss) achieves 0.805 F1, already substantially outperforming the single-head baseline (0.514), demonstrating the architectural decomposition itself provides the largest benefit.
| Configuration | R-Precision | R-Recall | R-F1 | Delta F1 |
|---|---|---|---|---|
| Full Model | 0.932 | 0.867 | 0.891 | |
| w/o CRF | 0.809 | 0.942 | 0.866 | -0.025 |
| w/o Focal Loss | 0.883 | 0.734 | 0.787 | -0.104 |
| w/o CRF + Gate | 0.809 | 0.942 | 0.866 | -0.025 |
| w/o All (Base Dual-Head) | 0.787 | 0.835 | 0.805 | -0.086 |
| Single-Head Baseline | 0.950 | 0.362 | 0.514 | -0.377 |
Synthetic dictation data generated by injecting known inserts into clean literary texts. Four density levels; each with 30 training, 5 validation, and 5 test samples.
| Density Level | Insert Rate | Total Tokens | KEEP % | REMOVE % | PARA_BREAK % | Train / Val / Test |
|---|---|---|---|---|---|---|
| Low | 5% | 4,020 | 92.79 | 5.22 | 1.99 | 30 / 5 / 5 |
| Medium | 10% | 4,217 | 88.45 | 9.65 | 1.90 | 30 / 5 / 5 |
| High | 15% | 4,361 | 85.53 | 12.63 | 1.83 | 30 / 5 / 5 |
| Very High | 25% | 4,716 | 79.09 | 19.21 | 1.70 | 30 / 5 / 5 |
Colors: KEEPREMOVE[PARA_BREAK]
Statistics: 112 tokens total: 96 KEEP, 14 REMOVE, 2 PARA_BREAK
Statistics: 122 tokens total: 96 KEEP, 24 REMOVE, 2 PARA_BREAK
| Single-Head Longformer | |||
|---|---|---|---|
| Pred \ True | KEEP | REMOVE | PARA |
| KEEP | 100.0% | 63.0% | 11.2% |
| REMOVE | 0.0% | 37.0% | 0.0% |
| PARA | 0.0% | 0.0% | 88.8% |
| Dual-Head (Ours) | |||
|---|---|---|---|
| Pred \ True | KEEP | REMOVE | PARA |
| KEEP | 99.3% | 12.9% | 11.2% |
| REMOVE | 0.7% | 87.1% | 0.0% |
| PARA | 0.0% | 0.0% | 88.8% |