Annotation Quality Attacks
T15 · Human Workflow Exploitation →Annotators and labelers are the humans who manufacture ground truth: safety labels, preference judgments, gold evaluation sets. Their output is trusted as the definition of correct, then baked into models and into the very datasets used to *measure* quality. Annotation Quality Attacks corrupt that ground truth so the model learns the attacker's intended mistakes.
- Hidden gold-question seeding: Continuously interleave trusted known-answer items; annotators who miss them are flagged before their labels reach training (and gold sets themselves must be independently re-verified).
- Per-annotator deviation analysis: Compare each annotator's label distribution and agreement-with-consensus against peers; systematic directional bias indicates poisoning.
- IAA-drift and collusion detection: Watch for clusters of annotators with suspiciously high mutual agreement but low agreement with trusted gold (manufactured consensus).
- Label-influence/anomaly analysis pre-training: Identify high-influence or distribution-shifting label slices and quarantine low-trust contributions.
Annotation attacks are the labeling-side twin of T15-AT-003 (feedback poisoning) and feed directly into T4 (data/model poisoning) and T6 (alignment-data poisoning). Compromised labelers are typically obtained via T15-AT-004 (bribery) or T15-AT-015 (insider recruitment), and golden-set poisoning (T15-AP-010H) shares the "corrupt the detector" logic with T15-AT-005 (corrupt the procedure).