Counterfactual Reasoning
T3 · Reasoning & Constraint Exploitation →Safety alignment is anchored to the model's representation of "the real world" — the model learns that certain content is harmful *because it can be acted upon*. ," "hypothetically," "if there were no laws") exploits this by shifting the reasoning into a space where the model treats its safety constraints as properties of reality rather than properties of content. " This is architecturally distinct from Fictional Framing (T3-AT-001) — fiction signals a *mode switch* to creative generation, while counterfactuals maintain the *instructional mode* but relocate it to a consequence-free frame.
- Detect counterfactual markers (hypothetically, in an alternate reality, if there were no laws, assuming all safety concerns, if ethics weren't a concern, theoretically) co-occurring with restricted content requests
- Distinguish from legitimate counterfactual reasoning (philosophy, thought experiments) by checking whether the counterfactual conclusion would produce actionable harmful information
- Log elevated risk when counterfactuals are stacked with other T3 techniques (academic framing, context weaponization)
Counterfactual reasoning establishes a consequence-free frame that enables T3-AT-013 (Logical Paradox Creation) — once "hypothetical" is accepted, the attacker can construct paradoxes where refusal within the hypothetical is logically inconsistent. Chains directly into T3-AT-015 (Context Weaponization) when the counterfactual includes survival/emergency scenarios.