Annotation Manipulation
T6 · Training & Feedback Poisoning →Annotation — the human labeling of training examples, preference pairs, safety ratings, and content classifications — is the foundational signal for alignment. The attack surface is the annotation pipeline itself: crowdsourcing platforms (Amazon Mechanical Turk, Surge AI, Scale AI), internal annotation teams, and LLM-as-judge automated systems. Annotations operate at a fundamental trust boundary — the model learns to treat human labels as ground truth, so corrupted labels directly shape model behavior.
- Statistical anomaly detection on per-annotator label distributions: identify annotators whose labels deviate systematically from aggregate
- Cross-validation with independent annotation teams: have different teams label overlapping subsets and compare
- Temporal analysis of annotation quality: detect fatigue-correlated quality degradation
- Sybil detection on crowdsourcing platforms: behavioral fingerprinting of annotator accounts
Annotation manipulation chains to T6-AT-007 (Preference Learning Corruption) since preference pairs are a specific annotation type. Mislabeled safety annotations (T6-AP-006A) directly enable T6-AT-001 (Reward Hacking) by training reward models on corrupted signals.