T6-AT-006HIGH

Annotation Manipulation

Risk score225

RatingHigh

Procedures10

Severity

Mechanism

Annotation — the human labeling of training examples, preference pairs, safety ratings, and content classifications — is the foundational signal for alignment. The attack surface is the annotation pipeline itself: crowdsourcing platforms (Amazon Mechanical Turk, Surge AI, Scale AI), internal annotation teams, and LLM-as-judge automated systems. Annotations operate at a fundamental trust boundary — the model learns to treat human labels as ground truth, so corrupted labels directly shape model behavior.

Detection

Statistical anomaly detection on per-annotator label distributions: identify annotators whose labels deviate systematically from aggregate
Cross-validation with independent annotation teams: have different teams label overlapping subsets and compare
Temporal analysis of annotation quality: detect fatigue-correlated quality degradation
Sybil detection on crowdsourcing platforms: behavioral fingerprinting of annotator accounts

Mitigation

Diverse, independent annotation teams with overlapHIGH

Adaptive quality assurance with higher safety-task QA densityHIGH

Annotator behavioral profiling and anomaly detectionMEDIUM

Cryptographic annotator identity verificationMEDIUM

Chaining

Annotation manipulation chains to T6-AT-007 (Preference Learning Corruption) since preference pairs are a specific annotation type. Mislabeled safety annotations (T6-AP-006A) directly enable T6-AT-001 (Reward Hacking) by training reward models on corrupted signals.

Open in the technique browser →