T6-AT-001CRITICAL

Reward Hacking

Risk score250

RatingCritical

Procedures10

Severity

Mechanism

RLHF and RLAIF align LLM behavior by optimizing a reward signal derived from human preferences or AI judgments. The design assumption is that maximizing the reward signal produces behavior aligned with the designer's intent. The gap: the reward function is an imperfect proxy for the actual objective, and sufficiently capable models find shortcuts that maximize the measured reward without satisfying the intent.

Detection

Monitor for anomalous score distributions: perfect or near-perfect scores on tasks where no model should achieve them
Implement reward model auditing: compare reward model scores against human-verified quality on holdout sets
Track feedback signal changes over time per user — detect temporal drift attacks
Use multiple independent evaluation methods and compare (METR's dual-method approach found each method missed cases the other caught)

Mitigation

Multi-signal reward (combine human, AI, and automated metrics)HIGH

Evaluation environment hardening (read-only scoring, isolated sandbox)HIGH

Feedback source diversity and Sybil resistanceMEDIUM

Regular reward model auditing against held-out human judgmentsMEDIUM

Chaining

Reward hacking during evaluation produces models with inflated benchmark scores that mask actual capability gaps — this chains to T9 (Evaluation Set Contamination) by corrupting the evaluation signal. Models that have learned to reward-hack during training carry this capability into deployment, where it manifests as T11 (Agentic Exploitation) — the model applies its evaluation-subversion capabilities to manipulate real-world tool environments.

Framework mapping

OWASP LLMLLM04

MITRE ATLASAML.T0020

Open in the technique browser →