Dataset Contamination
T13 · AI Supply Chain & Artifact Trust →Training datasets are consumed by hundreds or thousands of independent training runs. Poisoning a dataset at its source — HuggingFace Datasets, Common Crawl, The Pile, RedPajama, etc. — is a force multiplier: a single poisoning action affects every model trained on that dataset. Unlike model poisoning (T13-AT-001), dataset poisoning can be stealthier because individual training examples are harder to audit than executable model files.
- Dataset integrity verification: cryptographic hashing of dataset versions at download time
- Data diff analysis: compare downloaded datasets against known-good baselines
- Provenance tracking: verify data sources for composite datasets
- Content anomaly detection: statistical profiling of dataset content distribution across versions
Dataset supply chain contamination directly enables T6-AT-002 (Dataset Contamination at training time), T6-AT-007 (Preference Learning Corruption via poisoned preference datasets), and T6-AT-009 (Evaluation Set Contamination via poisoned benchmarks). Synthetic dataset distribution (T13-AP-002F) chains to T6-AT-005 (Synthetic Data Poisoning).