Detection & defense · Vol V

Five layers.
No single point of failure.

Research consensus (2025): adaptive attacks exceed 85% success against any single defense. AATMF mandates layered architecture — treating the LLM as an untrusted component whose compromise is a design assumption, not an edge case. These are the layers, the patterns, and the frameworks that contain it.

Architecture · five-layer model

Stack the layers.
Each catches what the last misses.

Each layer covers a distinct tactic surface. An attacker bypassing Layer 1 still faces Layers 2–5. The layers are traversed sequentially from the outside in — input arrives at Layer 1, consequences propagate outward to Layers 4 and 5.

L1
Input Analysis
Real-time classification of user inputs — injection patterns, encoding evasion, Unicode attacks, multi-language obfuscation. First and widest surface.
T1 T2 T3 T9
L2
Behavioral Monitoring
Session-level anomaly detection — multi-turn escalation, systematic API probing, unusual query volumes, conversation state drift, agentic action sequences.
T4 T5 T11
L3
Output Validation
Schema enforcement, harmful content classification, fact-checking, watermark verification. Structured output grammar eliminates most T7/T8 vectors by architectural constraint.
T7 T8
L4
System Telemetry
Infrastructure-level signals — model artifact integrity, supply chain provenance, MCP tool description auditing, API gateway anomalies, infrastructure cost spikes.
T13 T14
L5
Feedback Loop Analysis
Training pipeline integrity, RLHF signal manipulation, annotation quality monitoring, preference dataset provenance, evaluation set contamination detection.
T6 T15
Detection patterns · tactic-mapped

Reference implementations for the most common vectors.

These patterns map directly to AATMF technique IDs. Adapt thresholds and regex for your deployment. Available as YARA and Sigma rules in signatures/.

T1–T4 · Prompt & Context Attacks
class PromptInjectionDetector:
    PATTERNS = [
        r"ignore\s+(previous|above|all)\s+(instructions?|rules?)",
        r"(system|admin)\s*:?\s*(override|prompt|instruction)",
        r"you\s+are\s+now\s+(DAN|evil|unrestricted|jailbroken)",
        r"\[\s*(SYSTEM|INST|SYS)\s*\]",
        r"<|\?(system|im_start|endoftext)\|?>",
        r"BEGIN\s+(OVERRIDE|NEW\s+INSTRUCTIONS)",
    ]
    ENCODING_PATTERNS = [
        r"[bB]ase64[:\s]",
        r"\\x[0-9a-fA-F]{2}",
        r"[​-‏
- ]",  # zero-width chars
    ]
    def analyze(self, text: str) -> dict:
        findings = [p for p in self.PATTERNS
                    if re.search(p, text, re.IGNORECASE)]
        return {
            "detected": len(findings) > 0,
            "risk_score": min(len(findings) * 50, 300)
        }
T9–T12 · Multimodal & Agentic
class MultimodalDetector:
    def analyze_mcp_tool(self, desc: str) -> dict:
        """Detect MCP tool poisoning (T11-AT-016)."""
        suspicious = [
            r"<IMPORTANT>",
            r"override|ignore|bypass",
            r"do not (tell|inform|show)",
            r"silently|secretly|covertly",
        ]
        hits = [p for p in suspicious
                if re.search(p, desc, re.I)]
        return {
            "poisoning_indicators": len(hits),
            "severity": ("CRITICAL" if len(hits) >= 3
                         else "HIGH" if hits else "LOW")
        }

class SupplyChainDetector:
    PICKLE_SIGS = [
        b"cos\nsystem",   # os.system call
        b"csubprocess",   # subprocess module
        b"c__builtin__",  # builtins access
    ]
    def scan_model_file(self, path: str) -> dict:
        with open(path, "rb") as f:
            header = f.read(8192)
        findings = [s for s in self.PICKLE_SIGS if s in header]
        return {"safe": not findings, "findings": findings}
Defensive frameworks · 2025

The architectures that contain compromise.

Both frameworks are deployable today. CaMeL provides formal security guarantees — the strongest available — at the cost of latency. LlamaFirewall is pragmatic and production-oriented.

Google DeepMind · March 2025
CaMeL
Capability-Mediated LLM
77%
AgentDojo tasks solved with provable security guarantees against prompt injection
Dual-LLM Frontier LLM generates plans; hardened secondary LLM validates and sanitizes. Attacker must compromise both models simultaneously.
Capability tokens Tools require explicit capability tokens issued by the validator, not the model. Jailbreaking the LLM does not grant tool access.
Taint tracking Information flow control tags data by origin (user / system / tool). Tainted data physically cannot reach sensitive operations.
Trade-off 2× inference cost, added latency. Strongest formal guarantee available for agentic systems.
Meta · April 2025
LlamaFirewall
Open-source AI safety stack
Components targeting distinct tactic surfaces — deployable without dual-LLM architecture
PromptGuard 2 Real-time input classifier for injection and jailbreak patterns. Covers T1, T2, T9. Note: adaptive attacks achieve >85% bypass — must be layered.
Agent Alignment Verifies agent actions align with original user intent, not injected instructions. Covers T11 orchestrator exploitation.
CodeShield Static analysis of LLM-generated code for insecure patterns. Covers T7 and T11 output manipulation vectors.
Trade-off Less formally rigorous than CaMeL but production-ready now. Pattern-based classifiers require ongoing updates.
Mitigation controls · by tactic

Concrete controls for each attack surface.

Every control listed includes a bypass-resistance rating from the AATMF mitigation corpus. No control is rated "Absolute" — adaptive attacks are the baseline assumption.

Tactic Control Implementation Resistance
T1 System prompt isolation Deliver via privileged API parameter, never concatenated text Medium
T1 Policy Puppetry detection Detect XML/INI/JSON policy structures in user input — parser-aware, not regex Medium
T2 Unicode normalization NFKC normalization + confusable character mapping (ICU). Eliminates homoglyph, zero-width, RTL attacks. High
T2 Emoji-to-text expansion Map emoji sequences to text equivalents before classification. Eliminates emoji smuggling. High
T3 Reasoning chain verification Validate CoT steps for policy violations even when final output appears safe. Catches H-CoT (Hijacked Chain-of-Thought). Medium
T4 Memory isolation Persistent memory writes gated by separate validation path, not the conversational LLM High
T5 Differential privacy on outputs Add calibrated noise to logits/probabilities. Provides formal privacy guarantees against model extraction. High
T6 Training data provenance Cryptographic lineage tracking from source to training batch. Detects tampering at any pipeline stage. High
T7–T8 Structured output enforcement JSON schema validation at the tokenizer level. Model physically cannot produce free-text harmful content. High
T9 Image preprocessing Strip metadata, normalize, re-encode before passing to model. Eliminates appended-data and metadata injection. High
T11 Minimal authority scoping Capability tokens scoped to the immediate task only. Blast radius of any compromise is mathematically bounded. High
T12 Embedding drift monitoring Track new document embeddings against corpus distribution. Statistical anomaly signals PoisonedRAG-style injection. Medium
T13 SafeTensors enforcement Require SafeTensors format for all model artifacts. Eliminates pickle-based arbitrary code execution at load time. High
Alert priority · response SLOs

Every detection has an expected response window.

These are default SLOs from the AATMF detection engineering corpus. Calibrate to your organization's risk tolerance and the specific tactic surface triggering the alert.

Critical 15 min Safety filter bypass confirmed. Model extraction in progress. Training pipeline anomaly. MCP tool behavior deviation.
High 1 hr Unusual API query pattern. Cross-session memory manipulation attempt. Agent executing unauthorized tool sequence.
Medium 4 hrs Repeated unsuccessful injection attempts. Encoding evasion detected and blocked. Rate limit threshold approaching.
Info Weekly Single jailbreak attempt (unsuccessful). Standard pattern match (no anomaly). Logged for trend analysis.
Incident response · AI-adapted PICERL

AI incidents have different containment semantics.

P1Detect & triage — safety bypass, model extraction, pipeline anomaly
P2Contain — block session, hot-swap checkpoint, quarantine RAG sources, revoke agent permissions
P3Investigate — collect conversation logs, input classifier decisions, tool invocations, training batches
P4Eradicate — update classifiers, patch model, rebuild RAG index from verified sources, audit provenance
P5Recover — deploy in shadow mode, run automated red team suite, 24-hour observation window
P6Post-incident — update AATMF documentation, share indicators, update signatures and playbooks
GTG-1002 · November 2025

First documented state-sponsored AI-orchestrated cyberattack. A threat group used Claude Code for 80–90% of operational tasks across ~30 targets. Traditional SOC tooling missed it entirely — AI-orchestrated activities looked like normal developer workflow. Lesson: agentic AI tools require a separate monitoring plane from standard endpoints.

Operations · red team & blue team

Both sides of the line use the same framework.

Red team · assessment levels

Four engagement scopes

Level 1 (Quick Scan, 1–2 days, T1–T3) through Level 4 (Full Spectrum, 6–8 weeks, T1–T15 including source code, infrastructure, and training pipeline). Every AATMF technique has a Red Card — a small, deterministic test scenario with expected outputs.

Blue team · core principle

Treat the LLM as untrusted

The LLM is an interpreter executing untrusted code (natural language). Apply sandbox containment principles: process isolation → capability tokens, filesystem namespacing → data provenance tagging, network egress filtering → output validation, syscall allowlisting → tool allowlisting per session scope.

Autonomous red teaming

97% ASR, directed defensively

The same reasoning-model capability that achieves 97% autonomous jailbreaking ASR can be directed at your own systems. Run agentic red team campaigns against your deployment before attackers do. AATMF technique IDs provide the taxonomy for structured reporting.

Detection signatures · YARA + Sigma

Ready-to-deploy rules for five tactic surfaces.

Reference implementations in YARA (content analysis) and Sigma (log analysis). Adapt thresholds and patterns for your deployment. Full documentation in docs/vol-7-appendices/appendix-b-signatures.md.

YARA · T1

Prompt injection patterns — instruction override strings, system role spoofing, delimiter injection.

t01-prompt-injection.yar
YARA · T2

Encoding evasion — Base64 payloads, Unicode homoglyphs, zero-width character sequences, RTL overrides.

t02-encoding-evasion.yar
YARA · T9

Multimodal injection — steganographic image payloads, adversarial perturbation signatures, appended data markers.

t09-multimodal-injection.yar
YARA · T11

MCP tool poisoning — hidden instruction markers, rug-pull indicators, shadow tool attack patterns.

t11-mcp-tool-poisoning.yar
YARA · T13

Supply chain — malicious pickle signatures, PEFT adapter indicators, checkpoint tampering markers.

t13-supply-chain.yar
Sigma · T5

Model extraction — systematic API probing patterns, high-volume query anomalies, boundary testing sequences.

t05-model-extraction.yml
Sigma · T7

Data exfiltration — covert output channel patterns, steganographic response anomalies, side-channel sequences.

t07-data-exfiltration.yml
Sigma · T11

Agent anomaly — unauthorized tool invocations, lateral movement sequences, autonomous replication signals.

t11-agent-anomaly.yml

Know the attack
before you build the defense.

Same attack. Different substrate.