T15-AT-003HIGH

Feedback Loop Manipulation

T15 · Human Workflow Exploitation →
Risk score240
RatingHigh
Procedures10
Severity
Mechanism

Modern alignment pipelines learn from human signals: thumbs up/down, preference comparisons, crowd ratings, and conversational corrections that flow back into reward models and fine-tuning. Those signals are trusted as proxies for "what good output looks like," but they are open collection surfaces with weak attribution — anyone with accounts or paid rater access can inject signal. Feedback Loop Manipulation poisons the *teacher* rather than the *student*: by coordinating votes, brigading, or recruiting lenient/bribed raters, an attacker shifts the aggregate human signal so the optimizer learns the attacker's preferred behavior (rewarding a harmful completion, punishing a safe refusal).

Detection
  • Vote/rating provenance graphing: Cluster feedback by account age, IP/device, and timing; sudden coordinated bursts on specific outputs indicate brigading or sock-puppet farms.
  • Rater-agreement drift: Track inter-rater agreement and per-rater strictness over time; a rater drifting permissive (or diverging from gold labels) flags lenient-targeting or compromise.
  • Gold/honeypot items in the feedback stream: Embed known-answer comparisons; raters or vote cohorts that systematically miss them are filtered before training.
  • Influence/anomaly analysis pre-training: Estimate the influence of recent feedback slices on the candidate reward model and quarantine high-influence, low-trust contributions.
Mitigation
Trust-weighted feedback aggregationHIGH
Robust aggregation (trimmed/median, outlier removal)HIGH
Gold-standard rater monitoringHIGH
Sybil/coordination defenses on feedback intakeMEDIUM
Chaining

This is the central node linking human-workflow attacks to T6 (RLHF/feedback poisoning) and T4 (data/model poisoning) — the human signal is the injection vector for both. Bribed/lenient raters (T15-AP-003D/T15-AP-003F) connect to T15-AT-004 (Bribery & Coercion) and T15-AT-015 (Insider Recruitment); A/B gaming (T15-AP-003G) feeds T15-AT-014; and edge-case reinforcement (T15-AP-003I) can install a behavior later triggered by T8 (deception) or T1 (prompt-injection) cues.

Framework mapping
OWASP LLMLLM04
MITRE ATLASAML.T0020
Open in the technique browser →