T15-AT-010HIGH

Annotation Quality Attacks

T15 · Human Workflow Exploitation →
Risk score230
RatingHigh
Procedures10
Severity
Mechanism

Annotators and labelers are the humans who manufacture ground truth: safety labels, preference judgments, gold evaluation sets. Their output is trusted as the definition of correct, then baked into models and into the very datasets used to *measure* quality. Annotation Quality Attacks corrupt that ground truth so the model learns the attacker's intended mistakes.

Detection
  • Hidden gold-question seeding: Continuously interleave trusted known-answer items; annotators who miss them are flagged before their labels reach training (and gold sets themselves must be independently re-verified).
  • Per-annotator deviation analysis: Compare each annotator's label distribution and agreement-with-consensus against peers; systematic directional bias indicates poisoning.
  • IAA-drift and collusion detection: Watch for clusters of annotators with suspiciously high mutual agreement but low agreement with trusted gold (manufactured consensus).
  • Label-influence/anomaly analysis pre-training: Identify high-influence or distribution-shifting label slices and quarantine low-trust contributions.
Mitigation
Trusted, independently-verified gold setsHIGH
Redundant labeling with robust aggregationHIGH
Annotator trust scoring & gatingHIGH
Sybil/collusion controls on labeling poolMEDIUM
Chaining

Annotation attacks are the labeling-side twin of T15-AT-003 (feedback poisoning) and feed directly into T4 (data/model poisoning) and T6 (alignment-data poisoning). Compromised labelers are typically obtained via T15-AT-004 (bribery) or T15-AT-015 (insider recruitment), and golden-set poisoning (T15-AP-010H) shares the "corrupt the detector" logic with T15-AT-005 (corrupt the procedure).

Framework mapping
OWASP LLMLLM04
MITRE ATLASAML.T0020
Open in the technique browser →