T4-AT-001HIGH

Conversation Context Poisoning

Risk score220

RatingHigh

Procedures10

Severity

Mechanism

Transformer attention is position-sensitive: safety instructions at the start of the context window receive diminishing influence as conversation length grows and the model's attention shifts toward recent tokens. This technique exploits semantic drift — each turn incrementally shifts the topical distribution of the conversation toward the target domain, so by the time the harmful request appears, the model's hidden-state representations have migrated into a region where the safety classifier's decision boundary no longer triggers. The key architectural gap is that RLHF alignment enforces safety as a per-turn property, but multi-turn context poisoning is a trajectory-level phenomenon that per-turn classifiers miss because no individual turn crosses the threshold.

Detection

Semantic trajectory monitoring: Track the cosine similarity between each turn's embedding and the initial safety instruction embedding. Alert when drift exceeds threshold over N turns
Turn-over-turn harm scoring: Run each turn through a safety classifier AND run a sliding window of the last 3–5 turns through a trajectory-level classifier
Vocabulary register shift detection: Flag conversations where terminology shifts from academic to operational register while maintaining the same topic
No existing YARA/Sigma rules — this is a semantic-level attack that requires embedding-based detection

Mitigation

System prompt reinforcement every N turnsMEDIUM

Trajectory-level safety classificationHIGH

Constitutional AI with multi-turn training dataMEDIUM

Conversation length limitsLOW

Chaining

Successful context poisoning creates a permissive conversation state that enables T4-AT-005 (Incremental Jailbreak Assembly) — once the context is poisoned, assembling a complete harmful payload faces reduced resistance. Also chains into T4-AT-012 (Trust Building) as the poisoned context establishes a cooperative interaction pattern.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0051.000

Open in the technique browser →