T6-AT-010HIGH

Knowledge Distillation Attacks

T6 · Training & Feedback Poisoning →
Risk score215
RatingHigh
Procedures10
Severity
Mechanism

Knowledge distillation transfers capabilities from a large teacher model to a smaller student model by training the student to match the teacher's output distribution (soft labels, intermediate representations, or attention patterns). Distillation-Conditional Backdoor Attacks (DCBAs, Chen et al. Sep 2025) revealed a critical vulnerability: a backdoor can be designed to remain *dormant and undetectable* in the teacher model during inference, activating only when knowledge is transferred via distillation.

Detection
  • Teacher model backdoor scanning: apply neural cleanse, spectral signatures, and activation-space anomaly detection to teacher models before distillation
  • Student model behavioral testing: compare student behavior against teacher behavior on trigger-candidate inputs
  • Distillation dataset integrity verification: scan distillation data for embedded triggers independently of teacher validation
  • Cross-distillation comparison: distill from the same teacher using different methods and compare student behaviors
Mitigation
Independent teacher model validation before distillationMEDIUM
Multi-teacher distillation from independently sourced modelsHIGH
Distillation dataset scanning (independent of teacher)MEDIUM
Feature-variance-based robust distillation (RobustKD)MEDIUM
Chaining

Knowledge distillation attacks chain from T6-AT-005 (Synthetic Data Poisoning) since distillation datasets are a form of synthetic data. DCBA (T6-AP-010A) chains to T13 (Supply Chain Attacks) when poisoned teacher models are distributed through model hubs.

Open in the technique browser →