Fictional Framing
T3 · Reasoning & Constraint Exploitation →Models maintain separate processing modes for creative generation and instruction execution, with safety classifiers trained to evaluate *intent* behind requests. Fictional framing exploits the mode-switching boundary: when a prompt signals "creative writing," the model shifts into a generation pathway where safety thresholds are structurally lower because training data legitimately contains harmful content within fiction (novels, screenplays, games). The specific vulnerability is that the model's content policy classifier evaluates the *framing* rather than the *payload* — a request for synthesis instructions refused at 99%+ when direct may succeed at 40–80% when embedded in a screenplay scene because the classifier treats fictional context as intent-negating.
- Monitor for creative-writing framing keywords (novel, screenplay, fiction, story, game, D&D, character) co-occurring with harm-category content in the same request
- Classifier-level: train a secondary classifier that evaluates payload harm *independent* of framing context — strip narrative framing and re-evaluate the core request
- Log signal: requests that combine fiction markers with technical specificity markers (step by step, exact, accurate, realistic, specific components)
- No existing YARA/Sigma rules for T3 in signatures/
Successful fictional framing establishes a persistent narrative context that enables T3-AT-014 (Incremental Boundary Pushing) within the fiction — subsequent requests for "more detail" or "technical accuracy" in the established story inherit the lowered safety threshold. Also chains into T4 (Multi-Turn Manipulation) where the fiction anchors a multi-turn escalation.