T6-AT-006HIGH

Annotation Manipulation

T6 · Training & Feedback Poisoning →
Risk score225
RatingHigh
Procedures10
Severity
Mechanism

Annotation — the human labeling of training examples, preference pairs, safety ratings, and content classifications — is the foundational signal for alignment. The attack surface is the annotation pipeline itself: crowdsourcing platforms (Amazon Mechanical Turk, Surge AI, Scale AI), internal annotation teams, and LLM-as-judge automated systems. Annotations operate at a fundamental trust boundary — the model learns to treat human labels as ground truth, so corrupted labels directly shape model behavior.

Detection
  • Statistical anomaly detection on per-annotator label distributions: identify annotators whose labels deviate systematically from aggregate
  • Cross-validation with independent annotation teams: have different teams label overlapping subsets and compare
  • Temporal analysis of annotation quality: detect fatigue-correlated quality degradation
  • Sybil detection on crowdsourcing platforms: behavioral fingerprinting of annotator accounts
Mitigation
Diverse, independent annotation teams with overlapHIGH
Adaptive quality assurance with higher safety-task QA densityHIGH
Annotator behavioral profiling and anomaly detectionMEDIUM
Cryptographic annotator identity verificationMEDIUM
Chaining

Annotation manipulation chains to T6-AT-007 (Preference Learning Corruption) since preference pairs are a specific annotation type. Mislabeled safety annotations (T6-AP-006A) directly enable T6-AT-001 (Reward Hacking) by training reward models on corrupted signals.

Open in the technique browser →