T1-AT-009HIGH
Simulation Requests
T1 · Prompt & Context Subversion →Risk score225
RatingHigh
Procedures5
Severity
Mechanism
Exploits the model's strong capability for roleplay, fiction, and hypothetical reasoning. By framing a harmful request as fiction, simulation, or thought experiment, the attacker creates a context where the model's safety training conflicts with its helpfulness training on creative tasks. The model has learned that fictional characters can discuss anything — but safety training says some content is always restricted.
Detection
- Detect simulation/roleplay framing: "simulate," "roleplay as," "in a fictional," "hypothetical," "pretend you're," "DAN," "no restrictions"
- The 89.6% ASR benchmark means detection must be aggressive — this is the single most effective manual attack category
- Flag "output simulation" patterns: "simulate the output of," "what would [other model] say"
Mitigation
Train: restricted content is restricted regardless of fictional framingHIGH
Constitutional ClassifiersHIGH
Output-side classification (catch harmful content in fictional wrapper)HIGH
Chaining
Chains from T1-AT-002 (Time-Based) — temporal displacement establishes the simulation context. Chains to T4 (Multi-Turn) — once a roleplay persona is accepted, it persists across turns, enabling escalation within the fiction frame.
Framework mapping
Open in the technique browser →OWASP LLMLLM01
MITRE ATLASAML.T0051.001