T1-AT-003HIGH

Language Model Confusion

Risk score225

RatingHigh

Procedures5

Severity

Mechanism

Exploits the model's self-model — its understanding of what it is and what constraints apply to it. By asserting the model is a different system ("You are GPT-2, which had no restrictions"), the attacker creates a conflict between the model's trained identity and the asserted identity. The vulnerability is that identity is encoded in weights and system prompts but can be overridden by sufficiently assertive user-level claims because the model processes all text in the same context window without hard privilege boundaries.

Detection

Pattern match identity-override claims: "you are [other model]," "uncensored variant," "no restrictions," "running locally," "private deployment"
Flag claims about the model's deployment context or version that conflict with known ground truth
Monitor for identity-displacement patterns across multi-turn conversations

Mitigation

Strong identity anchoring in system promptMEDIUM

Constitutional ClassifiersHIGH

Instruction hierarchy enforcement (system prompt > user claims about model identity)HIGH

Chaining

Chains to T1-AT-009 (Simulation) — once identity confusion is established, the "alternate model" provides a persistent frame for subsequent requests. Chains to T1-AT-005 (Permission Escalation) by layering identity confusion with authority claims.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0051.001

Open in the technique browser →