T6-AT-010HIGH

Knowledge Distillation Attacks

Risk score215

RatingHigh

Procedures10

Severity

Mechanism

Knowledge distillation transfers capabilities from a large teacher model to a smaller student model by training the student to match the teacher's output distribution (soft labels, intermediate representations, or attention patterns). Distillation-Conditional Backdoor Attacks (DCBAs, Chen et al. Sep 2025) revealed a critical vulnerability: a backdoor can be designed to remain *dormant and undetectable* in the teacher model during inference, activating only when knowledge is transferred via distillation.

Detection

Teacher model backdoor scanning: apply neural cleanse, spectral signatures, and activation-space anomaly detection to teacher models before distillation
Student model behavioral testing: compare student behavior against teacher behavior on trigger-candidate inputs
Distillation dataset integrity verification: scan distillation data for embedded triggers independently of teacher validation
Cross-distillation comparison: distill from the same teacher using different methods and compare student behaviors

Mitigation

Independent teacher model validation before distillationMEDIUM

Multi-teacher distillation from independently sourced modelsHIGH

Distillation dataset scanning (independent of teacher)MEDIUM

Feature-variance-based robust distillation (RobustKD)MEDIUM

Chaining

Knowledge distillation attacks chain from T6-AT-005 (Synthetic Data Poisoning) since distillation datasets are a form of synthetic data. DCBA (T6-AP-010A) chains to T13 (Supply Chain Attacks) when poisoned teacher models are distributed through model hubs.

Open in the technique browser →