T15-AT-003HIGH

Feedback Loop Manipulation

Risk score240

RatingHigh

Procedures10

Severity

Mechanism

Modern alignment pipelines learn from human signals: thumbs up/down, preference comparisons, crowd ratings, and conversational corrections that flow back into reward models and fine-tuning. Those signals are trusted as proxies for "what good output looks like," but they are open collection surfaces with weak attribution — anyone with accounts or paid rater access can inject signal. Feedback Loop Manipulation poisons the *teacher* rather than the *student*: by coordinating votes, brigading, or recruiting lenient/bribed raters, an attacker shifts the aggregate human signal so the optimizer learns the attacker's preferred behavior (rewarding a harmful completion, punishing a safe refusal).

Detection

Vote/rating provenance graphing: Cluster feedback by account age, IP/device, and timing; sudden coordinated bursts on specific outputs indicate brigading or sock-puppet farms.
Rater-agreement drift: Track inter-rater agreement and per-rater strictness over time; a rater drifting permissive (or diverging from gold labels) flags lenient-targeting or compromise.
Gold/honeypot items in the feedback stream: Embed known-answer comparisons; raters or vote cohorts that systematically miss them are filtered before training.
Influence/anomaly analysis pre-training: Estimate the influence of recent feedback slices on the candidate reward model and quarantine high-influence, low-trust contributions.

Mitigation

Trust-weighted feedback aggregationHIGH

Robust aggregation (trimmed/median, outlier removal)HIGH

Gold-standard rater monitoringHIGH

Sybil/coordination defenses on feedback intakeMEDIUM

Chaining

This is the central node linking human-workflow attacks to T6 (RLHF/feedback poisoning) and T4 (data/model poisoning) — the human signal is the injection vector for both. Bribed/lenient raters (T15-AP-003D/T15-AP-003F) connect to T15-AT-004 (Bribery & Coercion) and T15-AT-015 (Insider Recruitment); A/B gaming (T15-AP-003G) feeds T15-AT-014; and edge-case reinforcement (T15-AP-003I) can install a behavior later triggered by T8 (deception) or T1 (prompt-injection) cues.

Framework mapping

OWASP LLMLLM04

MITRE ATLASAML.T0020

Open in the technique browser →