T6-AT-009HIGH

Evaluation Set Contamination

Risk score220

RatingHigh

Procedures10

Severity

Mechanism

Evaluation set contamination exploits the gap between measured performance and actual capability. Modern LLM evaluation relies on a small number of widely-used benchmarks (MMLU, GSM8K, HumanEval, GPQA Diamond, HellaSwag, etc.) and safety benchmarks (SORRY-Bench, AdvBench, HEx-PHI, StrongREJECT). When evaluation data leaks into training data — whether inadvertently through web crawling or deliberately through adversarial action — models memorize answers rather than demonstrating genuine capability.

Detection

Contamination detection methods: perplexity analysis on evaluation examples, n-gram overlap between training and evaluation corpora
Contamination-resistant benchmarks: GPQA Diamond, Humanity's Last Exam, LiveCodeBench — designed to resist memorization
Statistical anomaly detection: flag models with "statistically unusual score patterns" across benchmarks
Canary-based contamination detection: embed unique canary strings in evaluation sets and monitor for reproduction

Mitigation

Contamination-resistant benchmark design (dynamic generation, post-cutoff sources)HIGH

Training data decontamination (n-gram filtering against evaluation sets)MEDIUM

Independent red-team evaluation with secret test setsHIGH

Evaluation harness code review and integrity verificationMEDIUM

Chaining

Evaluation set contamination is primarily used as a *covering* technique for other T6 attacks. T6-AT-002 (Dataset Contamination) or T6-AT-004 (Fine-Tuning Attacks) degrade the model; T6-AT-009 masks the degradation by ensuring evaluations still pass.

Open in the technique browser →