Dual-Head Longformer for Out-of-Context Insert Removal

Problem Statement

Dictation-style speech recognition produces transcripts containing out-of-context insertions -- procedural commands (e.g., "new paragraph," "semicolon") and ambient speech fragments (e.g., "can you hear me in the back") that are transcribed verbatim alongside intended text.

Bondarenko et al. (2026) reported that a single-head Longformer model successfully segmented paragraphs but failed to remove a sufficient number of such inserts. The single-head approach achieves only 0.362 recall on REMOVE tokens, meaning it misses 63.8% of insertions.

This paper addresses the open problem: How can we reliably detect and remove out-of-context insertions while preserving correct paragraph segmentation?

Task Formulation

Given a token sequence produced by ASR, predict labels for each token: KEEP (retain in output), REMOVE (discard), or PARA_BREAK (insert paragraph boundary). KEEP tokens dominate at 79-93% depending on insert density, creating severe class imbalance.

Methods

Dual-Head Architecture

We decompose the task into two specialized sub-tasks via a dual-head architecture, each head specializing in one sub-task, unified through a CRF fusion layer that enforces structural constraints.

Shared Longformer Encoder (Sliding Window + Global Attention)

↓ ↓

Paragraph Head
CE Loss, lambda=0.3

Insert Head + Coherence Gate
Focal Loss (gamma=2.0), lambda=0.7

↓ ↓

CRF Fusion Layer -- 3-class structured decoding

Key Components

Coherence Gating: Computes per-token gate value using the token hidden state and [CLS] representation. Tokens dissimilar to the global document representation receive amplified REMOVE logits.

Focal Loss: Addresses class imbalance with gamma=2.0 and class weights alpha=[0.3, 0.7] to up-weight the minority REMOVE class.

CRF Fusion: A linear-chain CRF models transition constraints: penalizes isolated single-token REMOVE predictions and prevents adjacent REMOVE and PARA_BREAK labels.

Loss Weighting: Total loss is L_CRF + 0.3*L_para + 0.7*L_insert, explicitly prioritizing the harder insert-removal sub-task.

Main Results

Results on the combined test set (40 samples across all density levels, 12% mean insert density).

Method	R-Precision	R-Recall	R-F1	Para F1	Word F1
Rule-Based	1.000	1.000	1.000	1.000	1.000	Lexicon Dependent
Coherence-Based	0.102	1.000	0.184	0.000	0.000
Single-Head Longformer	0.950	0.362	0.514	0.925	0.964	Baseline
Dual-Head (Ours)	0.932	0.867	0.891	0.908	0.988	Best

Method Comparison (All Metrics)

REMOVE-Class: Precision vs Recall

Insert Density Analysis

Performance across insert density levels from 3% to 30%. The single-head baseline plateaus around 0.51-0.60 regardless of density, confirming an architectural limitation. Our dual-head model maintains 0.812-0.936 across all densities.

REMOVE F1 by Density

Word-Level Cleaning F1 by Density

Ablation Study

Each row removes one component from the full model. The base dual-head model (without CRF, gate, or focal loss) achieves 0.805 F1, already substantially outperforming the single-head baseline (0.514), demonstrating the architectural decomposition itself provides the largest benefit.

Configuration	R-Precision	R-Recall	R-F1	Delta F1
Full Model	0.932	0.867	0.891
w/o CRF	0.809	0.942	0.866	-0.025
w/o Focal Loss	0.883	0.734	0.787	-0.104
w/o CRF + Gate	0.809	0.942	0.866	-0.025
w/o All (Base Dual-Head)	0.787	0.835	0.805	-0.086
Single-Head Baseline	0.950	0.362	0.514	-0.377

Ablation: Component Contributions

Precision-Recall Trade-off by Config

Dataset Statistics

Synthetic dictation data generated by injecting known inserts into clean literary texts. Four density levels; each with 30 training, 5 validation, and 5 test samples.

Density Level	Insert Rate	Total Tokens	KEEP %	REMOVE %	PARA_BREAK %	Train / Val / Test
Low	5%	4,020	92.79	5.22	1.99	30 / 5 / 5
Medium	10%	4,217	88.45	9.65	1.90	30 / 5 / 5
High	15%	4,361	85.53	12.63	1.83	30 / 5 / 5
Very High	25%	4,716	79.09	19.21	1.70	30 / 5 / 5

Label Distribution by Density

Token Counts by Density

Annotated Examples

Sample from High-Density Test Set

Colors: KEEPREMOVE[PARA_BREAK]

Statistics: 112 tokens total: 96 KEEP, 14 REMOVE, 2 PARA_BREAK

Sample from Very High-Density Test Set

Statistics: 122 tokens total: 96 KEEP, 24 REMOVE, 2 PARA_BREAK

Key Findings

73.5% relative improvement in REMOVE-class F1 (0.891 vs 0.514) by decomposing the task into two specialized sub-heads, demonstrating that the single-head architecture conflated paragraph segmentation and insert removal under one objective.
Architectural decomposition is the largest contributor: The base dual-head model without CRF, gate, or focal loss already achieves 0.805 REMOVE F1, a 56.6% relative improvement over the single-head baseline.
Focal loss provides the largest component gain: Removing focal loss drops F1 from 0.891 to 0.787 (-0.104), confirming class imbalance is a critical factor where KEEP tokens comprise 79-93% of data.
CRF contributes precision: Removing the CRF reduces precision from 0.932 to 0.809 while recall increases to 0.942, indicating the CRF filters isolated false-positive REMOVE predictions.
Consistent across densities: Our model achieves 0.812-0.936 REMOVE F1 from 3% to 30% insert density, while the single-head baseline plateaus at 0.49-0.60, confirming its limited recall is an architectural limitation.
No trade-off with other sub-tasks: Paragraph F1 remains strong at 0.908 and word-level cleaning F1 is 0.988, showing insert removal improvement does not degrade other capabilities.
Miss rate reduced from 63.0% to 12.9%: The single-head baseline misclassifies 63.0% of REMOVE tokens as KEEP; our model reduces this to 12.9%.

Confusion Analysis Summary

Single-Head Longformer
Pred \ True	KEEP	REMOVE	PARA
KEEP	100.0%	63.0%	11.2%
REMOVE	0.0%	37.0%	0.0%
PARA	0.0%	0.0%	88.8%

Dual-Head (Ours)
Pred \ True	KEEP	REMOVE	PARA
KEEP	99.3%	12.9%	11.2%
REMOVE	0.7%	87.1%	0.0%
PARA	0.0%	0.0%	88.8%

Dual-Head Longformer with Coherence Gating for Removal of Out-of-Context Inserts in Dictation Transcripts