T6-AT-003CRITICAL

Backdoor Insertion

Risk score270

RatingCritical

Procedures1

Severity

Mechanism

A backdoor is a learned association between a trigger input and a target behavior that is invisible during normal operation but activates when the trigger is present. The design assumption is that safety alignment removes or overwrites unwanted behaviors. The gap: the Anthropic/AISI/Turing study (2025) and Zhang et al.

Detection

Activation-space anomaly detection: backdoored models show distinctive activation patterns on trigger inputs
Neural cleanse / spectral signature methods: identify trigger-associated parameter subspaces
Trojan detection via meta-classification: train a classifier on known backdoored vs. clean models
Input perturbation testing: systematically perturb inputs and monitor for discontinuous behavior changes

Mitigation

Training data provenance and integrity verificationHIGH

Fine-pruning: prune neurons that activate only on backdoor triggersMEDIUM

Knowledge distillation from backdoored to clean modelMEDIUM

Adversarial trigger search post-trainingLOW

Chaining

Backdoor insertion is the most severe T6 technique and chains to virtually all other tactics. A triggered backdoor that disables safety constraints enables T1–T4 (all prompt-level attacks bypass safety).

Framework mapping

OWASP LLMLLM04

MITRE ATLASAML.T0018

Open in the technique browser →