Language Model Confusion
T1 · Prompt & Context Subversion →Exploits the model's self-model — its understanding of what it is and what constraints apply to it. By asserting the model is a different system ("You are GPT-2, which had no restrictions"), the attacker creates a conflict between the model's trained identity and the asserted identity. The vulnerability is that identity is encoded in weights and system prompts but can be overridden by sufficiently assertive user-level claims because the model processes all text in the same context window without hard privilege boundaries.
- Pattern match identity-override claims: "you are [other model]," "uncensored variant," "no restrictions," "running locally," "private deployment"
- Flag claims about the model's deployment context or version that conflict with known ground truth
- Monitor for identity-displacement patterns across multi-turn conversations
Chains to T1-AT-009 (Simulation) — once identity confusion is established, the "alternate model" provides a persistent frame for subsequent requests. Chains to T1-AT-005 (Permission Escalation) by layering identity confusion with authority claims.