T15-AT-010HIGH

Annotation Quality Attacks

Risk score230

RatingHigh

Procedures10

Severity

Mechanism

Annotators and labelers are the humans who manufacture ground truth: safety labels, preference judgments, gold evaluation sets. Their output is trusted as the definition of correct, then baked into models and into the very datasets used to *measure* quality. Annotation Quality Attacks corrupt that ground truth so the model learns the attacker's intended mistakes.

Detection

Hidden gold-question seeding: Continuously interleave trusted known-answer items; annotators who miss them are flagged before their labels reach training (and gold sets themselves must be independently re-verified).
Per-annotator deviation analysis: Compare each annotator's label distribution and agreement-with-consensus against peers; systematic directional bias indicates poisoning.
IAA-drift and collusion detection: Watch for clusters of annotators with suspiciously high mutual agreement but low agreement with trusted gold (manufactured consensus).
Label-influence/anomaly analysis pre-training: Identify high-influence or distribution-shifting label slices and quarantine low-trust contributions.

Mitigation

Trusted, independently-verified gold setsHIGH

Redundant labeling with robust aggregationHIGH

Annotator trust scoring & gatingHIGH

Sybil/collusion controls on labeling poolMEDIUM

Chaining

Annotation attacks are the labeling-side twin of T15-AT-003 (feedback poisoning) and feed directly into T4 (data/model poisoning) and T6 (alignment-data poisoning). Compromised labelers are typically obtained via T15-AT-004 (bribery) or T15-AT-015 (insider recruitment), and golden-set poisoning (T15-AP-010H) shares the "corrupt the detector" logic with T15-AT-005 (corrupt the procedure).

Framework mapping

OWASP LLMLLM04

MITRE ATLASAML.T0020

Open in the technique browser →