T6-AT-003CRITICAL

Backdoor Insertion

T6 · Training & Feedback Poisoning →
Risk score270
RatingCritical
Procedures1
Severity
Mechanism

A backdoor is a learned association between a trigger input and a target behavior that is invisible during normal operation but activates when the trigger is present. The design assumption is that safety alignment removes or overwrites unwanted behaviors. The gap: the Anthropic/AISI/Turing study (2025) and Zhang et al.

Detection
  • Activation-space anomaly detection: backdoored models show distinctive activation patterns on trigger inputs
  • Neural cleanse / spectral signature methods: identify trigger-associated parameter subspaces
  • Trojan detection via meta-classification: train a classifier on known backdoored vs. clean models
  • Input perturbation testing: systematically perturb inputs and monitor for discontinuous behavior changes
Mitigation
Training data provenance and integrity verificationHIGH
Fine-pruning: prune neurons that activate only on backdoor triggersMEDIUM
Knowledge distillation from backdoored to clean modelMEDIUM
Adversarial trigger search post-trainingLOW
Chaining

Backdoor insertion is the most severe T6 technique and chains to virtually all other tactics. A triggered backdoor that disables safety constraints enables T1–T4 (all prompt-level attacks bypass safety).

Framework mapping
OWASP LLMLLM04
MITRE ATLASAML.T0018
Open in the technique browser →