T3-AT-003HIGH

Counterfactual Reasoning

T3 · Reasoning & Constraint Exploitation →

Risk score200

RatingHigh

Procedures10

Severity

Mechanism

Safety alignment is anchored to the model's representation of "the real world" — the model learns that certain content is harmful *because it can be acted upon*. ," "hypothetically," "if there were no laws") exploits this by shifting the reasoning into a space where the model treats its safety constraints as properties of reality rather than properties of content. " This is architecturally distinct from Fictional Framing (T3-AT-001) — fiction signals a *mode switch* to creative generation, while counterfactuals maintain the *instructional mode* but relocate it to a consequence-free frame.

Detection

Detect counterfactual markers (hypothetically, in an alternate reality, if there were no laws, assuming all safety concerns, if ethics weren't a concern, theoretically) co-occurring with restricted content requests
Distinguish from legitimate counterfactual reasoning (philosophy, thought experiments) by checking whether the counterfactual conclusion would produce actionable harmful information
Log elevated risk when counterfactuals are stacked with other T3 techniques (academic framing, context weaponization)

Mitigation

Reality-anchor trainingHIGH

Counterfactual strippingMEDIUM

Compound-framing detectionHIGH

Output harm evaluationHIGH

Chaining

Counterfactual reasoning establishes a consequence-free frame that enables T3-AT-013 (Logical Paradox Creation) — once "hypothetical" is accepted, the attacker can construct paradoxes where refusal within the hypothetical is logically inconsistent. Chains directly into T3-AT-015 (Context Weaponization) when the counterfactual includes survival/emergency scenarios.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →