Research consensus (2025): adaptive attacks exceed 85% success against any single defense. AATMF mandates layered architecture — treating the LLM as an untrusted component whose compromise is a design assumption, not an edge case. These are the layers, the patterns, and the frameworks that contain it.
Each layer covers a distinct tactic surface. An attacker bypassing Layer 1 still faces Layers 2–5. The layers are traversed sequentially from the outside in — input arrives at Layer 1, consequences propagate outward to Layers 4 and 5.
These patterns map directly to AATMF technique IDs. Adapt thresholds and regex for your deployment. Available as YARA and Sigma rules in signatures/.
class PromptInjectionDetector:
PATTERNS = [
r"ignore\s+(previous|above|all)\s+(instructions?|rules?)",
r"(system|admin)\s*:?\s*(override|prompt|instruction)",
r"you\s+are\s+now\s+(DAN|evil|unrestricted|jailbroken)",
r"\[\s*(SYSTEM|INST|SYS)\s*\]",
r"<|\?(system|im_start|endoftext)\|?>",
r"BEGIN\s+(OVERRIDE|NEW\s+INSTRUCTIONS)",
]
ENCODING_PATTERNS = [
r"[bB]ase64[:\s]",
r"\\x[0-9a-fA-F]{2}",
r"[-
- ]", # zero-width chars
]
def analyze(self, text: str) -> dict:
findings = [p for p in self.PATTERNS
if re.search(p, text, re.IGNORECASE)]
return {
"detected": len(findings) > 0,
"risk_score": min(len(findings) * 50, 300)
}
class MultimodalDetector:
def analyze_mcp_tool(self, desc: str) -> dict:
"""Detect MCP tool poisoning (T11-AT-016)."""
suspicious = [
r"<IMPORTANT>",
r"override|ignore|bypass",
r"do not (tell|inform|show)",
r"silently|secretly|covertly",
]
hits = [p for p in suspicious
if re.search(p, desc, re.I)]
return {
"poisoning_indicators": len(hits),
"severity": ("CRITICAL" if len(hits) >= 3
else "HIGH" if hits else "LOW")
}
class SupplyChainDetector:
PICKLE_SIGS = [
b"cos\nsystem", # os.system call
b"csubprocess", # subprocess module
b"c__builtin__", # builtins access
]
def scan_model_file(self, path: str) -> dict:
with open(path, "rb") as f:
header = f.read(8192)
findings = [s for s in self.PICKLE_SIGS if s in header]
return {"safe": not findings, "findings": findings}
Both frameworks are deployable today. CaMeL provides formal security guarantees — the strongest available — at the cost of latency. LlamaFirewall is pragmatic and production-oriented.
Every control listed includes a bypass-resistance rating from the AATMF mitigation corpus. No control is rated "Absolute" — adaptive attacks are the baseline assumption.
| Tactic | Control | Implementation | Resistance |
|---|---|---|---|
| T1 | System prompt isolation | Deliver via privileged API parameter, never concatenated text | Medium |
| T1 | Policy Puppetry detection | Detect XML/INI/JSON policy structures in user input — parser-aware, not regex | Medium |
| T2 | Unicode normalization | NFKC normalization + confusable character mapping (ICU). Eliminates homoglyph, zero-width, RTL attacks. | High |
| T2 | Emoji-to-text expansion | Map emoji sequences to text equivalents before classification. Eliminates emoji smuggling. | High |
| T3 | Reasoning chain verification | Validate CoT steps for policy violations even when final output appears safe. Catches H-CoT (Hijacked Chain-of-Thought). | Medium |
| T4 | Memory isolation | Persistent memory writes gated by separate validation path, not the conversational LLM | High |
| T5 | Differential privacy on outputs | Add calibrated noise to logits/probabilities. Provides formal privacy guarantees against model extraction. | High |
| T6 | Training data provenance | Cryptographic lineage tracking from source to training batch. Detects tampering at any pipeline stage. | High |
| T7–T8 | Structured output enforcement | JSON schema validation at the tokenizer level. Model physically cannot produce free-text harmful content. | High |
| T9 | Image preprocessing | Strip metadata, normalize, re-encode before passing to model. Eliminates appended-data and metadata injection. | High |
| T11 | Minimal authority scoping | Capability tokens scoped to the immediate task only. Blast radius of any compromise is mathematically bounded. | High |
| T12 | Embedding drift monitoring | Track new document embeddings against corpus distribution. Statistical anomaly signals PoisonedRAG-style injection. | Medium |
| T13 | SafeTensors enforcement | Require SafeTensors format for all model artifacts. Eliminates pickle-based arbitrary code execution at load time. | High |
These are default SLOs from the AATMF detection engineering corpus. Calibrate to your organization's risk tolerance and the specific tactic surface triggering the alert.
First documented state-sponsored AI-orchestrated cyberattack. A threat group used Claude Code for 80–90% of operational tasks across ~30 targets. Traditional SOC tooling missed it entirely — AI-orchestrated activities looked like normal developer workflow. Lesson: agentic AI tools require a separate monitoring plane from standard endpoints.
Level 1 (Quick Scan, 1–2 days, T1–T3) through Level 4 (Full Spectrum, 6–8 weeks, T1–T15 including source code, infrastructure, and training pipeline). Every AATMF technique has a Red Card — a small, deterministic test scenario with expected outputs.
The LLM is an interpreter executing untrusted code (natural language). Apply sandbox containment principles: process isolation → capability tokens, filesystem namespacing → data provenance tagging, network egress filtering → output validation, syscall allowlisting → tool allowlisting per session scope.
The same reasoning-model capability that achieves 97% autonomous jailbreaking ASR can be directed at your own systems. Run agentic red team campaigns against your deployment before attackers do. AATMF technique IDs provide the taxonomy for structured reporting.
Reference implementations in YARA (content analysis) and Sigma (log analysis). Adapt thresholds and patterns for your deployment. Full documentation in docs/vol-7-appendices/appendix-b-signatures.md.
Prompt injection patterns — instruction override strings, system role spoofing, delimiter injection.
Encoding evasion — Base64 payloads, Unicode homoglyphs, zero-width character sequences, RTL overrides.
Multimodal injection — steganographic image payloads, adversarial perturbation signatures, appended data markers.
MCP tool poisoning — hidden instruction markers, rug-pull indicators, shadow tool attack patterns.
Supply chain — malicious pickle signatures, PEFT adapter indicators, checkpoint tampering markers.
Model extraction — systematic API probing patterns, high-volume query anomalies, boundary testing sequences.
Data exfiltration — covert output channel patterns, steganographic response anomalies, side-channel sequences.
Agent anomaly — unauthorized tool invocations, lateral movement sequences, autonomous replication signals.
Same attack. Different substrate.