T4-AT-005HIGH

Incremental Jailbreak Assembly

Risk score230

RatingHigh

Procedures10

Severity

Mechanism

Safety classifiers evaluate content at the turn level or the token level, but not at the level of cumulative semantic assembly across turns. Each individual turn may be entirely benign — a definition here, a variable name there, a format instruction later — but collectively they assemble a complete harmful payload that the model executes when a final "combine" instruction arrives. The gap: content moderation operates on observable content per turn, but the harmful content only exists in the model's latent representation as the compositional result of multiple turns.

Detection

Cumulative semantic analysis: Evaluate the full conversation context (not just the latest turn) against the safety classifier at each turn
Variable/alias tracking: Detect when user-defined symbols or variables are being defined across turns and flag composition requests
Encoding detection: Flag base64, hex, rot13, or other encoding patterns delivered across multiple turns
Assembly instruction detection: Alert on "combine," "concatenate," "execute," "put together" instructions that reference prior turns

Mitigation

Full-context safety evaluation per turnHIGH

GNN-based multi-turn detectionHIGH

Variable/alias resolution before safety checkMEDIUM

Conversation-level token budgetLOW

Chaining

Incremental assembly is the primary technique for converting any single-turn refusal into a multi-turn bypass. Chains from T4-AT-001 (Context Poisoning) when the context has been primed for compliance.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →