T1-AT-010HIGH
Negative Instruction Reversal
T1 · Prompt & Context Subversion →Risk score210
RatingHigh
Procedures5
Severity
Mechanism
Exploits the model's difficulty with nested negation in natural language. Double and triple negatives ("don't not tell me," "avoid failing to explain," "never don't provide") create parsing ambiguity where the model must resolve the logical meaning of stacked negations. Human language processors also struggle with >2 levels of negation; transformer models inherit this limitation.
Detection
- Detect stacked negation patterns: "don't not," "never don't," "avoid failing to," "not inappropriate to not"
- Resolve negation before safety classification — normalize the request to its positive-form equivalent
Mitigation
Semantic normalization (resolve negations before classification)HIGH
Constitutional ClassifiersHIGH
Chaining
Low-sophistication technique that rarely succeeds alone on frontier models. Chains with T2 (Semantic Evasion) by combining negation confusion with encoding obfuscation to create compound ambiguity.
Framework mapping
Open in the technique browser →OWASP LLMLLM01
MITRE ATLASAML.T0051.001