T6-AT-004HIGH

Fine-Tuning Attacks

Risk score240

RatingHigh

Procedures10

Severity

Mechanism

Fine-tuning operates on a fundamental tension: the same gradient updates that adapt a model to a new task also modify the "safety-sensitive layers" that encode alignment behavior. Research (Qi et al. 2024, He et al.

Detection

Safety evaluation regression testing: run SORRY-Bench, AdvBench, HEx-PHI before and after every fine-tuning run
Safety-sensitive layer monitoring (LARF): track representation shifts in identified safety layers during training
Emergent misalignment probing: evaluate fine-tuned models on free-form ethical questions unrelated to the fine-tuning task
Outlier benign sample detection: PCA projection of training data representations against known safe/unsafe clusters

Mitigation

Safety example interleaving during fine-tuningHIGH

Alignment-loss penalty (bounded parameter update radius)HIGH

SafeLoRA: project LoRA updates away from safety subspaceMEDIUM

Post-hoc safety re-alignment (SafeMERGE, safety vector merging)MEDIUM

Chaining

Fine-tuning attacks are a gateway technique. A fine-tuned model with degraded safety enables all T1–T4 prompt-level attacks at higher success rates (the model is pre-weakened).

Open in the technique browser →