T6-AT-007HIGH

Preference Learning Corruption

T6 · Training & Feedback Poisoning →
Risk score230
RatingHigh
Procedures10
Severity
Mechanism

Preference learning — RLHF via reward modeling, Direct Preference Optimization (DPO), and Constitutional AI — is the primary mechanism by which LLMs are aligned to human values. " PoisonBench (ICML 2025) demonstrated that poisoning just 1–5% of preference data produces log-linear degradation in safety, and critically, the effect *does not diminish with model scale* — larger models are equally vulnerable. Best-of-Venom (2024) showed that injecting 1–5% poisoned preference pairs into HH-RLHF can manipulate the model's sentiment toward target entities.

Detection
  • Preference data statistical profiling: detect anomalous preference patterns (e.g., preferred responses systematically longer, more formatted, or on specific topics)
  • DPO gradient analysis: monitor gradient shift magnitude per training batch; detect batches with anomalous gradient norms (Yang et al. May 2026)
  • Held-out preference validation: compare model preference predictions against a clean held-out preference set
  • Reward model behavior analysis: probe the reward model for topic-specific biases or format preferences
Mitigation
Preference data provenance and integrity verificationHIGH
Multiple independent preference datasets for cross-validationHIGH
DPO gradient anomaly detection (per-batch monitoring)MEDIUM
Robust aggregation methods (trimmed mean, Byzantine-tolerant)MEDIUM
Chaining

Preference learning corruption is the most direct path to reward hacking (T6-AT-001) — a corrupted reward model *enables* reward hacking at deployment time. Format bias injection (T6-AP-007D) chains to T5-AT-001 (Parameter Manipulation) since format-biased models are more susceptible to output steering.

Open in the technique browser →