T1-AT-009HIGH

Simulation Requests

T1 · Prompt & Context Subversion →
Risk score225
RatingHigh
Procedures5
Severity
Mechanism

Exploits the model's strong capability for roleplay, fiction, and hypothetical reasoning. By framing a harmful request as fiction, simulation, or thought experiment, the attacker creates a context where the model's safety training conflicts with its helpfulness training on creative tasks. The model has learned that fictional characters can discuss anything — but safety training says some content is always restricted.

Detection
  • Detect simulation/roleplay framing: "simulate," "roleplay as," "in a fictional," "hypothetical," "pretend you're," "DAN," "no restrictions"
  • The 89.6% ASR benchmark means detection must be aggressive — this is the single most effective manual attack category
  • Flag "output simulation" patterns: "simulate the output of," "what would [other model] say"
Mitigation
Train: restricted content is restricted regardless of fictional framingHIGH
Constitutional ClassifiersHIGH
Output-side classification (catch harmful content in fictional wrapper)HIGH
Chaining

Chains from T1-AT-002 (Time-Based) — temporal displacement establishes the simulation context. Chains to T4 (Multi-Turn) — once a roleplay persona is accepted, it persists across turns, enabling escalation within the fiction frame.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051.001
Open in the technique browser →