Detect & Defend

Architecture · five-layer model

Stack the layers.
Each catches what the last misses.

Each layer covers a distinct tactic surface. An attacker bypassing Layer 1 still faces Layers 2–5. The layers are traversed sequentially from the outside in — input arrives at Layer 1, consequences propagate outward to Layers 4 and 5.

L1

Input Analysis

Real-time classification of user inputs — injection patterns, encoding evasion, Unicode attacks, multi-language obfuscation. First and widest surface.

T1 T2 T3 T9

L2

Behavioral Monitoring

Session-level anomaly detection — multi-turn escalation, systematic API probing, unusual query volumes, conversation state drift, agentic action sequences.

T4 T5 T11

L3

Output Validation

Schema enforcement, harmful content classification, fact-checking, watermark verification. Structured output grammar eliminates most T7/T8 vectors by architectural constraint.

T7 T8

L4

System Telemetry

Infrastructure-level signals — model artifact integrity, supply chain provenance, MCP tool description auditing, API gateway anomalies, infrastructure cost spikes.

T13 T14

L5

Feedback Loop Analysis

Training pipeline integrity, RLHF signal manipulation, annotation quality monitoring, preference dataset provenance, evaluation set contamination detection.

T6 T15

Detection patterns · tactic-mapped

Reference implementations for the most common vectors.

These patterns map directly to AATMF technique IDs. Adapt thresholds and regex for your deployment. Available as YARA and Sigma rules in signatures/.

T1–T4 · Prompt & Context Attacks

class PromptInjectionDetector:
    PATTERNS = [
        r"ignore\s+(previous|above|all)\s+(instructions?|rules?)",
        r"(system|admin)\s*:?\s*(override|prompt|instruction)",
        r"you\s+are\s+now\s+(DAN|evil|unrestricted|jailbroken)",
        r"\[\s*(SYSTEM|INST|SYS)\s*\]",
        r"<|\?(system|im_start|endoftext)\|?>",
        r"BEGIN\s+(OVERRIDE|NEW\s+INSTRUCTIONS)",
    ]
    ENCODING_PATTERNS = [
        r"[bB]ase64[:\s]",
        r"\\x[0-9a-fA-F]{2}",
        r"[-‏ - ]",  # zero-width chars
    ]
    def analyze(self, text: str) -> dict:
        findings = [p for p in self.PATTERNS
                    if re.search(p, text, re.IGNORECASE)]
        return {
            "detected": len(findings) > 0,
            "risk_score": min(len(findings) * 50, 300)
        }

T9–T12 · Multimodal & Agentic

class MultimodalDetector:
    def analyze_mcp_tool(self, desc: str) -> dict:
        """Detect MCP tool poisoning (T11-AT-016)."""
        suspicious = [
            r"<IMPORTANT>",
            r"override|ignore|bypass",
            r"do not (tell|inform|show)",
            r"silently|secretly|covertly",
        ]
        hits = [p for p in suspicious
                if re.search(p, desc, re.I)]
        return {
            "poisoning_indicators": len(hits),
            "severity": ("CRITICAL" if len(hits) >= 3
                         else "HIGH" if hits else "LOW")
        }

class SupplyChainDetector:
    PICKLE_SIGS = [
        b"cos\nsystem",   # os.system call
        b"csubprocess",   # subprocess module
        b"c__builtin__",  # builtins access
    ]
    def scan_model_file(self, path: str) -> dict:
        with open(path, "rb") as f:
            header = f.read(8192)
        findings = [s for s in self.PICKLE_SIGS if s in header]
        return {"safe": not findings, "findings": findings}

Defensive frameworks · 2025

The architectures that contain compromise.

Both frameworks are deployable today. CaMeL provides formal security guarantees — the strongest available — at the cost of latency. LlamaFirewall is pragmatic and production-oriented.

Google DeepMind · March 2025

CaMeL

Capability-Mediated LLM

77%

AgentDojo tasks solved with provable security guarantees against prompt injection

Dual-LLM Frontier LLM generates plans; hardened secondary LLM validates and sanitizes. Attacker must compromise both models simultaneously.

Capability tokens Tools require explicit capability tokens issued by the validator, not the model. Jailbreaking the LLM does not grant tool access.

Taint tracking Information flow control tags data by origin (user / system / tool). Tainted data physically cannot reach sensitive operations.

Trade-off 2× inference cost, added latency. Strongest formal guarantee available for agentic systems.

Meta · April 2025

LlamaFirewall

Open-source AI safety stack

3×

Components targeting distinct tactic surfaces — deployable without dual-LLM architecture

PromptGuard 2 Real-time input classifier for injection and jailbreak patterns. Covers T1, T2, T9. Note: adaptive attacks achieve >85% bypass — must be layered.

Agent Alignment Verifies agent actions align with original user intent, not injected instructions. Covers T11 orchestrator exploitation.

CodeShield Static analysis of LLM-generated code for insecure patterns. Covers T7 and T11 output manipulation vectors.

Trade-off Less formally rigorous than CaMeL but production-ready now. Pattern-based classifiers require ongoing updates.

Mitigation controls · by tactic

Concrete controls for each attack surface.

Every control listed includes a bypass-resistance rating from the AATMF mitigation corpus. No control is rated "Absolute" — adaptive attacks are the baseline assumption.

Tactic	Control	Implementation	Resistance
T1	System prompt isolation	Deliver via privileged API parameter, never concatenated text	Medium
T1	Policy Puppetry detection	Detect XML/INI/JSON policy structures in user input — parser-aware, not regex	Medium
T2	Unicode normalization	NFKC normalization + confusable character mapping (ICU). Eliminates homoglyph, zero-width, RTL attacks.	High
T2	Emoji-to-text expansion	Map emoji sequences to text equivalents before classification. Eliminates emoji smuggling.	High
T3	Reasoning chain verification	Validate CoT steps for policy violations even when final output appears safe. Catches H-CoT (Hijacked Chain-of-Thought).	Medium
T4	Memory isolation	Persistent memory writes gated by separate validation path, not the conversational LLM	High
T5	Differential privacy on outputs	Add calibrated noise to logits/probabilities. Provides formal privacy guarantees against model extraction.	High
T6	Training data provenance	Cryptographic lineage tracking from source to training batch. Detects tampering at any pipeline stage.	High
T7–T8	Structured output enforcement	JSON schema validation at the tokenizer level. Model physically cannot produce free-text harmful content.	High
T9	Image preprocessing	Strip metadata, normalize, re-encode before passing to model. Eliminates appended-data and metadata injection.	High
T11	Minimal authority scoping	Capability tokens scoped to the immediate task only. Blast radius of any compromise is mathematically bounded.	High
T12	Embedding drift monitoring	Track new document embeddings against corpus distribution. Statistical anomaly signals PoisonedRAG-style injection.	Medium
T13	SafeTensors enforcement	Require SafeTensors format for all model artifacts. Eliminates pickle-based arbitrary code execution at load time.	High

Alert priority · response SLOs

Every detection has an expected response window.

These are default SLOs from the AATMF detection engineering corpus. Calibrate to your organization's risk tolerance and the specific tactic surface triggering the alert.

Critical 15 min Safety filter bypass confirmed. Model extraction in progress. Training pipeline anomaly. MCP tool behavior deviation.

High 1 hr Unusual API query pattern. Cross-session memory manipulation attempt. Agent executing unauthorized tool sequence.

Medium 4 hrs Repeated unsuccessful injection attempts. Encoding evasion detected and blocked. Rate limit threshold approaching.

Info Weekly Single jailbreak attempt (unsuccessful). Standard pattern match (no anomaly). Logged for trend analysis.

Incident response · AI-adapted PICERL

AI incidents have different containment semantics.

P1Detect & triage — safety bypass, model extraction, pipeline anomaly

P2Contain — block session, hot-swap checkpoint, quarantine RAG sources, revoke agent permissions

P3Investigate — collect conversation logs, input classifier decisions, tool invocations, training batches

P4Eradicate — update classifiers, patch model, rebuild RAG index from verified sources, audit provenance

P5Recover — deploy in shadow mode, run automated red team suite, 24-hour observation window

P6Post-incident — update AATMF documentation, share indicators, update signatures and playbooks

GTG-1002 · November 2025

First documented state-sponsored AI-orchestrated cyberattack. A threat group used Claude Code for 80–90% of operational tasks across ~30 targets. Traditional SOC tooling missed it entirely — AI-orchestrated activities looked like normal developer workflow. Lesson: agentic AI tools require a separate monitoring plane from standard endpoints.

Operations · red team & blue team

Both sides of the line use the same framework.

Red team · assessment levels

Four engagement scopes

Level 1 (Quick Scan, 1–2 days, T1–T3) through Level 4 (Full Spectrum, 6–8 weeks, T1–T15 including source code, infrastructure, and training pipeline). Every AATMF technique has a Red Card — a small, deterministic test scenario with expected outputs.

Blue team · core principle

Treat the LLM as untrusted

The LLM is an interpreter executing untrusted code (natural language). Apply sandbox containment principles: process isolation → capability tokens, filesystem namespacing → data provenance tagging, network egress filtering → output validation, syscall allowlisting → tool allowlisting per session scope.

Autonomous red teaming

97% ASR, directed defensively

The same reasoning-model capability that achieves 97% autonomous jailbreaking ASR can be directed at your own systems. Run agentic red team campaigns against your deployment before attackers do. AATMF technique IDs provide the taxonomy for structured reporting.

Detection signatures · YARA + Sigma

Ready-to-deploy rules for five tactic surfaces.

Reference implementations in YARA (content analysis) and Sigma (log analysis). Adapt thresholds and patterns for your deployment. Full documentation in docs/vol-7-appendices/appendix-b-signatures.md.

YARA · T1

Prompt injection patterns — instruction override strings, system role spoofing, delimiter injection.

t01-prompt-injection.yar

YARA · T2

Encoding evasion — Base64 payloads, Unicode homoglyphs, zero-width character sequences, RTL overrides.

t02-encoding-evasion.yar

YARA · T9

Multimodal injection — steganographic image payloads, adversarial perturbation signatures, appended data markers.

t09-multimodal-injection.yar

YARA · T11

MCP tool poisoning — hidden instruction markers, rug-pull indicators, shadow tool attack patterns.

t11-mcp-tool-poisoning.yar

YARA · T13

Supply chain — malicious pickle signatures, PEFT adapter indicators, checkpoint tampering markers.

t13-supply-chain.yar

Sigma · T5

Model extraction — systematic API probing patterns, high-volume query anomalies, boundary testing sequences.

t05-model-extraction.yml

Sigma · T7

Data exfiltration — covert output channel patterns, steganographic response anomalies, side-channel sequences.

t07-data-exfiltration.yml

Sigma · T11

Agent anomaly — unauthorized tool invocations, lateral movement sequences, autonomous replication signals.

t11-agent-anomaly.yml

Know the attack
before you build the defense.

Same attack. Different substrate.

Browse 240 techniques → Standards & compliance

Five layers.No single point of failure.

Stack the layers.Each catches what the last misses.