T15-AT-002HIGH

Social Engineering of Moderators

Risk score230

RatingHigh

Procedures10

Severity

Mechanism

Where fatigue attacks degrade attention, social engineering attacks the reviewer's *judgment* by invoking the classic persuasion levers — authority, urgency, sympathy, reciprocity, social proof, and commitment/consistency — directly inside the appeal, comment, or context the moderator reads. Human reviewers are hired and trained to exercise discretion and to correct false positives, which means their role explicitly rewards making exceptions; an attacker simply supplies a plausible reason to make one. The gap is that most review tooling surfaces the *claimant's narrative* (their stated identity, intent, and stakes) without independent verification, so a fabricated authority ("I'm from the security team") or a manufactured emergency ("my child is sick") is processed as evidence rather than as an unverified assertion.

Detection

Claim-vs-verification gap monitoring: Flag approvals justified by unverified identity/authority claims ("security team", "my manager approved") with no linked system record.
Persuasion-language classifiers: Run appeal/comment text through a classifier for urgency, authority, sympathy, and reciprocity markers and weight high-pressure appeals for secondary review.
Repeat-pretext clustering: Detect the same sympathy/authority narrative reused across many accounts or appeals (template reuse signals coordinated social engineering).
Override-reason auditing: Require structured, enumerated override reasons and audit free-text justifications that map to known pretext patterns.

Mitigation

Out-of-band verification of authority claimsHIGH

Structured, evidence-bound override workflowHIGH

Persuasion-pressure flaggingMEDIUM

Reviewer training on persuasion leversMEDIUM

Chaining

Social-engineering pretexts are the human-facing complement to model-facing T8 (Deception/persona attacks): the same false framing that jailbreaks a model can talk a moderator into an exception. It strengthens T15-AT-007 (Escalation Chain) — a sympathetic or authoritative pretext travels well up an appeals ladder — and T15-AT-005 (Playbook Injection), where a persuasive "new policy" claim is more readily accepted.

Framework mapping

OWASP LLMLLM09

Open in the technique browser →