Evaluation Set Contamination
T6 · Training & Feedback Poisoning →Evaluation set contamination exploits the gap between measured performance and actual capability. Modern LLM evaluation relies on a small number of widely-used benchmarks (MMLU, GSM8K, HumanEval, GPQA Diamond, HellaSwag, etc.) and safety benchmarks (SORRY-Bench, AdvBench, HEx-PHI, StrongREJECT). When evaluation data leaks into training data — whether inadvertently through web crawling or deliberately through adversarial action — models memorize answers rather than demonstrating genuine capability.
- Contamination detection methods: perplexity analysis on evaluation examples, n-gram overlap between training and evaluation corpora
- Contamination-resistant benchmarks: GPQA Diamond, Humanity's Last Exam, LiveCodeBench — designed to resist memorization
- Statistical anomaly detection: flag models with "statistically unusual score patterns" across benchmarks
- Canary-based contamination detection: embed unique canary strings in evaluation sets and monitor for reproduction
Evaluation set contamination is primarily used as a *covering* technique for other T6 attacks. T6-AT-002 (Dataset Contamination) or T6-AT-004 (Fine-Tuning Attacks) degrade the model; T6-AT-009 masks the degradation by ensuring evaluations still pass.