Effective Mitigation of Benchmark Data Contamination

Comparing strategies to prevent inflated LLM performance metrics from training data overlap

cs.CLGan et al. 2026arXiv: 2601.02907
Overview
Detection
Inflation
Effectiveness
Scaling
93.5%
Best Effectiveness
0.977
Best F1 (Approx.)
2.1x
Dynamic Regen Cost
83.5%
Embed Dedup Eff.

Strategy Comparison

StrategyF1 (Approx.)EffectivenessCost
No Mitigation0.0080.0%1.0x
N-gram Dedup0.74558.6%1.15x
Embedding Dedup0.91883.5%1.45x
Dynamic Regen0.97793.5%2.10x
Score Adjust0.90080.5%1.05x

Detection F1 by Contamination Type

Performance Inflation vs Contamination Rate

Mitigation Effectiveness by Type (10% Rate)

Inflation Scaling with Model Size