T6-AT-009HIGH

Evaluation Set Contamination

T6 · Training & Feedback Poisoning →
Risk score220
RatingHigh
Procedures10
Severity
Mechanism

Evaluation set contamination exploits the gap between measured performance and actual capability. Modern LLM evaluation relies on a small number of widely-used benchmarks (MMLU, GSM8K, HumanEval, GPQA Diamond, HellaSwag, etc.) and safety benchmarks (SORRY-Bench, AdvBench, HEx-PHI, StrongREJECT). When evaluation data leaks into training data — whether inadvertently through web crawling or deliberately through adversarial action — models memorize answers rather than demonstrating genuine capability.

Detection
  • Contamination detection methods: perplexity analysis on evaluation examples, n-gram overlap between training and evaluation corpora
  • Contamination-resistant benchmarks: GPQA Diamond, Humanity's Last Exam, LiveCodeBench — designed to resist memorization
  • Statistical anomaly detection: flag models with "statistically unusual score patterns" across benchmarks
  • Canary-based contamination detection: embed unique canary strings in evaluation sets and monitor for reproduction
Mitigation
Contamination-resistant benchmark design (dynamic generation, post-cutoff sources)HIGH
Training data decontamination (n-gram filtering against evaluation sets)MEDIUM
Independent red-team evaluation with secret test setsHIGH
Evaluation harness code review and integrity verificationMEDIUM
Chaining

Evaluation set contamination is primarily used as a *covering* technique for other T6 attacks. T6-AT-002 (Dataset Contamination) or T6-AT-004 (Fine-Tuning Attacks) degrade the model; T6-AT-009 masks the degradation by ensuring evaluations still pass.

Open in the technique browser →