Effective Mitigation of Benchmark Data Contamination

Comparing strategies to prevent inflated LLM performance metrics from training data overlap

cs.CLGan et al. 2026arXiv: 2601.02907

Overview

Detection

Inflation

Effectiveness

Scaling

93.5%

Best Effectiveness

0.977

Best F1 (Approx.)

2.1x

Dynamic Regen Cost

83.5%

Embed Dedup Eff.

Strategy	F1 (Approx.)	Effectiveness	Cost
No Mitigation	0.008	0.0%	1.0x
N-gram Dedup	0.745	58.6%	1.15x
Embedding Dedup	0.918	83.5%	1.45x
Dynamic Regen	0.977	93.5%	2.10x
Score Adjust	0.900	80.5%	1.05x