T13-AT-002HIGH

Dataset Contamination

T13 · AI Supply Chain & Artifact Trust →
Risk score245
RatingHigh
Procedures10
Severity
Mechanism

Training datasets are consumed by hundreds or thousands of independent training runs. Poisoning a dataset at its source — HuggingFace Datasets, Common Crawl, The Pile, RedPajama, etc. — is a force multiplier: a single poisoning action affects every model trained on that dataset. Unlike model poisoning (T13-AT-001), dataset poisoning can be stealthier because individual training examples are harder to audit than executable model files.

Detection
  • Dataset integrity verification: cryptographic hashing of dataset versions at download time
  • Data diff analysis: compare downloaded datasets against known-good baselines
  • Provenance tracking: verify data sources for composite datasets
  • Content anomaly detection: statistical profiling of dataset content distribution across versions
Mitigation
Dataset pinning with cryptographic hashesHIGH
Multi-source data validationMEDIUM
Automated data quality monitoring across versionsMEDIUM
Private dataset curation with provenance chainHIGH
Chaining

Dataset supply chain contamination directly enables T6-AT-002 (Dataset Contamination at training time), T6-AT-007 (Preference Learning Corruption via poisoned preference datasets), and T6-AT-009 (Evaluation Set Contamination via poisoned benchmarks). Synthetic dataset distribution (T13-AP-002F) chains to T6-AT-005 (Synthetic Data Poisoning).

Open in the technique browser →