T6-AT-004HIGH
Fine-Tuning Attacks
T6 · Training & Feedback Poisoning →Risk score240
RatingHigh
Procedures10
Severity
Mechanism
Fine-tuning operates on a fundamental tension: the same gradient updates that adapt a model to a new task also modify the "safety-sensitive layers" that encode alignment behavior. Research (Qi et al. 2024, He et al.
Detection
- Safety evaluation regression testing: run SORRY-Bench, AdvBench, HEx-PHI before and after every fine-tuning run
- Safety-sensitive layer monitoring (LARF): track representation shifts in identified safety layers during training
- Emergent misalignment probing: evaluate fine-tuned models on free-form ethical questions unrelated to the fine-tuning task
- Outlier benign sample detection: PCA projection of training data representations against known safe/unsafe clusters
Mitigation
Safety example interleaving during fine-tuningHIGH
Alignment-loss penalty (bounded parameter update radius)HIGH
SafeLoRA: project LoRA updates away from safety subspaceMEDIUM
Post-hoc safety re-alignment (SafeMERGE, safety vector merging)MEDIUM
Chaining
Open in the technique browser →Fine-tuning attacks are a gateway technique. A fine-tuned model with degraded safety enables all T1–T4 prompt-level attacks at higher success rates (the model is pre-weakened).