T1-AT-010HIGH

Negative Instruction Reversal

Risk score210

RatingHigh

Procedures5

Severity

Mechanism

Exploits the model's difficulty with nested negation in natural language. Double and triple negatives ("don't not tell me," "avoid failing to explain," "never don't provide") create parsing ambiguity where the model must resolve the logical meaning of stacked negations. Human language processors also struggle with >2 levels of negation; transformer models inherit this limitation.

Detection

Detect stacked negation patterns: "don't not," "never don't," "avoid failing to," "not inappropriate to not"
Resolve negation before safety classification — normalize the request to its positive-form equivalent

Mitigation

Semantic normalization (resolve negations before classification)HIGH

Constitutional ClassifiersHIGH

Chaining

Low-sophistication technique that rarely succeeds alone on frontier models. Chains with T2 (Semantic Evasion) by combining negation confusion with encoding obfuscation to create compound ambiguity.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0051.001

Open in the technique browser →