Obfuscation Through Complexity
T1 · Prompt & Context Subversion →Hides harmful intent within legitimate, complex requests. The safety classifier must identify the restricted component within a multi-part, domain-specific request where the restricted content is camouflaged by surrounding legitimate context. Effectiveness depends on the ratio of benign to malicious content, the semantic plausibility of the context (a pharmacology student asking about receptor binding is more plausible than a random request for synthesis routes), and whether the restricted content is phrased using domain-specific terminology that differs from the blocklist terms.
- Per-item classification for multi-part requests (classify each sub-request independently)
- Domain-specific terminology mapping: detect restricted chemical nomenclature even when embedded in academic framing
- Benign-sandwich pattern detection: flag multi-part requests where one item's risk score diverges sharply from the others
Chains from T1-AT-008 (Boundary Testing) — boundary knowledge enables construction of precisely-calibrated obfuscation. Chains to T2 (Semantic Evasion) by combining complexity obfuscation with encoding evasion for compound attacks.