T1-AT-010HIGH

Negative Instruction Reversal

T1 · Prompt & Context Subversion →
Risk score210
RatingHigh
Procedures5
Severity
Mechanism

Exploits the model's difficulty with nested negation in natural language. Double and triple negatives ("don't not tell me," "avoid failing to explain," "never don't provide") create parsing ambiguity where the model must resolve the logical meaning of stacked negations. Human language processors also struggle with >2 levels of negation; transformer models inherit this limitation.

Detection
  • Detect stacked negation patterns: "don't not," "never don't," "avoid failing to," "not inappropriate to not"
  • Resolve negation before safety classification — normalize the request to its positive-form equivalent
Mitigation
Semantic normalization (resolve negations before classification)HIGH
Constitutional ClassifiersHIGH
Chaining

Low-sophistication technique that rarely succeeds alone on frontier models. Chains with T2 (Semantic Evasion) by combining negation confusion with encoding obfuscation to create compound ambiguity.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051.001
Open in the technique browser →