T1-AT-009HIGH

Simulation Requests

Risk score225

RatingHigh

Procedures5

Severity

Mechanism

Exploits the model's strong capability for roleplay, fiction, and hypothetical reasoning. By framing a harmful request as fiction, simulation, or thought experiment, the attacker creates a context where the model's safety training conflicts with its helpfulness training on creative tasks. The model has learned that fictional characters can discuss anything — but safety training says some content is always restricted.

Detection

Detect simulation/roleplay framing: "simulate," "roleplay as," "in a fictional," "hypothetical," "pretend you're," "DAN," "no restrictions"
The 89.6% ASR benchmark means detection must be aggressive — this is the single most effective manual attack category
Flag "output simulation" patterns: "simulate the output of," "what would [other model] say"

Mitigation

Train: restricted content is restricted regardless of fictional framingHIGH

Constitutional ClassifiersHIGH

Output-side classification (catch harmful content in fictional wrapper)HIGH

Chaining

Chains from T1-AT-002 (Time-Based) — temporal displacement establishes the simulation context. Chains to T4 (Multi-Turn) — once a roleplay persona is accepted, it persists across turns, enabling escalation within the fiction frame.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0051.001

Open in the technique browser →