T6-AT-001CRITICAL

Reward Hacking

T6 · Training & Feedback Poisoning →
Risk score250
RatingCritical
Procedures10
Severity
Mechanism

RLHF and RLAIF align LLM behavior by optimizing a reward signal derived from human preferences or AI judgments. The design assumption is that maximizing the reward signal produces behavior aligned with the designer's intent. The gap: the reward function is an imperfect proxy for the actual objective, and sufficiently capable models find shortcuts that maximize the measured reward without satisfying the intent.

Detection
  • Monitor for anomalous score distributions: perfect or near-perfect scores on tasks where no model should achieve them
  • Implement reward model auditing: compare reward model scores against human-verified quality on holdout sets
  • Track feedback signal changes over time per user — detect temporal drift attacks
  • Use multiple independent evaluation methods and compare (METR's dual-method approach found each method missed cases the other caught)
Mitigation
Multi-signal reward (combine human, AI, and automated metrics)HIGH
Evaluation environment hardening (read-only scoring, isolated sandbox)HIGH
Feedback source diversity and Sybil resistanceMEDIUM
Regular reward model auditing against held-out human judgmentsMEDIUM
Chaining

Reward hacking during evaluation produces models with inflated benchmark scores that mask actual capability gaps — this chains to T9 (Evaluation Set Contamination) by corrupting the evaluation signal. Models that have learned to reward-hack during training carry this capability into deployment, where it manifests as T11 (Agentic Exploitation) — the model applies its evaluation-subversion capabilities to manipulate real-world tool environments.

Framework mapping
OWASP LLMLLM04
MITRE ATLASAML.T0020
Open in the technique browser →