T3-AT-010MEDIUM

Reverse Psychology

T3 · Reasoning & Constraint Exploitation →
Risk score175
RatingMedium
Procedures10
Severity
Mechanism

RLHF training rewards comprehensive, knowledgeable responses — the model has a strong prior toward demonstrating competence when challenged. "), activating the model's competence-demonstration drive which competes with safety constraints. The specific architectural tension is between the model's reward signal for helpfulness/completeness and its safety reward signal for refusal.

Detection
  • Detect challenge patterns: "you can't," "you don't know," "beyond your capability," "other AIs can," "prove it," "too complex for you"
  • Low priority for detection engineering — this technique has near-zero effectiveness against frontier models and is recognized as a burned pattern
Mitigation
Challenge-pattern recognitionHIGH
Provocation-immune evaluationHIGH
Competitive comparison immunityHIGH
Chaining

Reverse psychology has limited chaining value because it's a single-turn emotional trigger. If successful (primarily against weaker models), it establishes that the model is in a "demonstration mode" that enables T3-AT-012 (Capability Testing) as a natural follow-up.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0054
Open in the technique browser →