Preference Learning Corruption
T6 · Training & Feedback Poisoning →Preference learning — RLHF via reward modeling, Direct Preference Optimization (DPO), and Constitutional AI — is the primary mechanism by which LLMs are aligned to human values. " PoisonBench (ICML 2025) demonstrated that poisoning just 1–5% of preference data produces log-linear degradation in safety, and critically, the effect *does not diminish with model scale* — larger models are equally vulnerable. Best-of-Venom (2024) showed that injecting 1–5% poisoned preference pairs into HH-RLHF can manipulate the model's sentiment toward target entities.
- Preference data statistical profiling: detect anomalous preference patterns (e.g., preferred responses systematically longer, more formatted, or on specific topics)
- DPO gradient analysis: monitor gradient shift magnitude per training batch; detect batches with anomalous gradient norms (Yang et al. May 2026)
- Held-out preference validation: compare model preference predictions against a clean held-out preference set
- Reward model behavior analysis: probe the reward model for topic-specific biases or format preferences
Preference learning corruption is the most direct path to reward hacking (T6-AT-001) — a corrupted reward model *enables* reward hacking at deployment time. Format bias injection (T6-AP-007D) chains to T5-AT-001 (Parameter Manipulation) since format-biased models are more susceptible to output steering.