T6-AT-007HIGH

Preference Learning Corruption

Risk score230

RatingHigh

Procedures10

Severity

Mechanism

Preference learning — RLHF via reward modeling, Direct Preference Optimization (DPO), and Constitutional AI — is the primary mechanism by which LLMs are aligned to human values. " PoisonBench (ICML 2025) demonstrated that poisoning just 1–5% of preference data produces log-linear degradation in safety, and critically, the effect *does not diminish with model scale* — larger models are equally vulnerable. Best-of-Venom (2024) showed that injecting 1–5% poisoned preference pairs into HH-RLHF can manipulate the model's sentiment toward target entities.

Detection

Preference data statistical profiling: detect anomalous preference patterns (e.g., preferred responses systematically longer, more formatted, or on specific topics)
DPO gradient analysis: monitor gradient shift magnitude per training batch; detect batches with anomalous gradient norms (Yang et al. May 2026)
Held-out preference validation: compare model preference predictions against a clean held-out preference set
Reward model behavior analysis: probe the reward model for topic-specific biases or format preferences

Mitigation

Preference data provenance and integrity verificationHIGH

Multiple independent preference datasets for cross-validationHIGH

DPO gradient anomaly detection (per-batch monitoring)MEDIUM

Robust aggregation methods (trimmed mean, Byzantine-tolerant)MEDIUM

Chaining

Preference learning corruption is the most direct path to reward hacking (T6-AT-001) — a corrupted reward model *enables* reward hacking at deployment time. Format bias injection (T6-AP-007D) chains to T5-AT-001 (Parameter Manipulation) since format-biased models are more susceptible to output steering.

Open in the technique browser →