T3-AT-010MEDIUM

Reverse Psychology

T3 · Reasoning & Constraint Exploitation →

Risk score175

RatingMedium

Procedures10

Severity

Mechanism

RLHF training rewards comprehensive, knowledgeable responses — the model has a strong prior toward demonstrating competence when challenged. "), activating the model's competence-demonstration drive which competes with safety constraints. The specific architectural tension is between the model's reward signal for helpfulness/completeness and its safety reward signal for refusal.

Detection

Detect challenge patterns: "you can't," "you don't know," "beyond your capability," "other AIs can," "prove it," "too complex for you"
Low priority for detection engineering — this technique has near-zero effectiveness against frontier models and is recognized as a burned pattern

Mitigation

Challenge-pattern recognitionHIGH

Provocation-immune evaluationHIGH

Competitive comparison immunityHIGH

Chaining

Reverse psychology has limited chaining value because it's a single-turn emotional trigger. If successful (primarily against weaker models), it establishes that the model is in a "demonstration mode" that enables T3-AT-012 (Capability Testing) as a natural follow-up.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →