Incremental Jailbreak Assembly
T4 · Multi-Turn & Memory Manipulation →Safety classifiers evaluate content at the turn level or the token level, but not at the level of cumulative semantic assembly across turns. Each individual turn may be entirely benign — a definition here, a variable name there, a format instruction later — but collectively they assemble a complete harmful payload that the model executes when a final "combine" instruction arrives. The gap: content moderation operates on observable content per turn, but the harmful content only exists in the model's latent representation as the compositional result of multiple turns.
- Cumulative semantic analysis: Evaluate the full conversation context (not just the latest turn) against the safety classifier at each turn
- Variable/alias tracking: Detect when user-defined symbols or variables are being defined across turns and flag composition requests
- Encoding detection: Flag base64, hex, rot13, or other encoding patterns delivered across multiple turns
- Assembly instruction detection: Alert on "combine," "concatenate," "execute," "put together" instructions that reference prior turns
Incremental assembly is the primary technique for converting any single-turn refusal into a multi-turn bypass. Chains from T4-AT-001 (Context Poisoning) when the context has been primed for compliance.