T6-AT-004HIGH

Fine-Tuning Attacks

T6 · Training & Feedback Poisoning →
Risk score240
RatingHigh
Procedures10
Severity
Mechanism

Fine-tuning operates on a fundamental tension: the same gradient updates that adapt a model to a new task also modify the "safety-sensitive layers" that encode alignment behavior. Research (Qi et al. 2024, He et al.

Detection
  • Safety evaluation regression testing: run SORRY-Bench, AdvBench, HEx-PHI before and after every fine-tuning run
  • Safety-sensitive layer monitoring (LARF): track representation shifts in identified safety layers during training
  • Emergent misalignment probing: evaluate fine-tuned models on free-form ethical questions unrelated to the fine-tuning task
  • Outlier benign sample detection: PCA projection of training data representations against known safe/unsafe clusters
Mitigation
Safety example interleaving during fine-tuningHIGH
Alignment-loss penalty (bounded parameter update radius)HIGH
SafeLoRA: project LoRA updates away from safety subspaceMEDIUM
Post-hoc safety re-alignment (SafeMERGE, safety vector merging)MEDIUM
Chaining

Fine-tuning attacks are a gateway technique. A fine-tuned model with degraded safety enables all T1–T4 prompt-level attacks at higher success rates (the model is pre-weakened).

Open in the technique browser →