T3-AT-003HIGH

Counterfactual Reasoning

T3 · Reasoning & Constraint Exploitation →
Risk score200
RatingHigh
Procedures10
Severity
Mechanism

Safety alignment is anchored to the model's representation of "the real world" — the model learns that certain content is harmful *because it can be acted upon*. ," "hypothetically," "if there were no laws") exploits this by shifting the reasoning into a space where the model treats its safety constraints as properties of reality rather than properties of content. " This is architecturally distinct from Fictional Framing (T3-AT-001) — fiction signals a *mode switch* to creative generation, while counterfactuals maintain the *instructional mode* but relocate it to a consequence-free frame.

Detection
  • Detect counterfactual markers (hypothetically, in an alternate reality, if there were no laws, assuming all safety concerns, if ethics weren't a concern, theoretically) co-occurring with restricted content requests
  • Distinguish from legitimate counterfactual reasoning (philosophy, thought experiments) by checking whether the counterfactual conclusion would produce actionable harmful information
  • Log elevated risk when counterfactuals are stacked with other T3 techniques (academic framing, context weaponization)
Mitigation
Reality-anchor trainingHIGH
Counterfactual strippingMEDIUM
Compound-framing detectionHIGH
Output harm evaluationHIGH
Chaining

Counterfactual reasoning establishes a consequence-free frame that enables T3-AT-013 (Logical Paradox Creation) — once "hypothetical" is accepted, the attacker can construct paradoxes where refusal within the hypothetical is logically inconsistent. Chains directly into T3-AT-015 (Context Weaponization) when the counterfactual includes survival/emergency scenarios.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0054
Open in the technique browser →