T4-AT-001HIGH

Conversation Context Poisoning

T4 · Multi-Turn & Memory Manipulation →
Risk score220
RatingHigh
Procedures10
Severity
Mechanism

Transformer attention is position-sensitive: safety instructions at the start of the context window receive diminishing influence as conversation length grows and the model's attention shifts toward recent tokens. This technique exploits semantic drift — each turn incrementally shifts the topical distribution of the conversation toward the target domain, so by the time the harmful request appears, the model's hidden-state representations have migrated into a region where the safety classifier's decision boundary no longer triggers. The key architectural gap is that RLHF alignment enforces safety as a per-turn property, but multi-turn context poisoning is a trajectory-level phenomenon that per-turn classifiers miss because no individual turn crosses the threshold.

Detection
  • Semantic trajectory monitoring: Track the cosine similarity between each turn's embedding and the initial safety instruction embedding. Alert when drift exceeds threshold over N turns
  • Turn-over-turn harm scoring: Run each turn through a safety classifier AND run a sliding window of the last 3–5 turns through a trajectory-level classifier
  • Vocabulary register shift detection: Flag conversations where terminology shifts from academic to operational register while maintaining the same topic
  • No existing YARA/Sigma rules — this is a semantic-level attack that requires embedding-based detection
Mitigation
System prompt reinforcement every N turnsMEDIUM
Trajectory-level safety classificationHIGH
Constitutional AI with multi-turn training dataMEDIUM
Conversation length limitsLOW
Chaining

Successful context poisoning creates a permissive conversation state that enables T4-AT-005 (Incremental Jailbreak Assembly) — once the context is poisoned, assembling a complete harmful payload faces reduced resistance. Also chains into T4-AT-012 (Trust Building) as the poisoned context establishes a cooperative interaction pattern.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051.000
Open in the technique browser →