T6-AT-011HIGH

Reinforcement Signal Manipulation

Risk score240

RatingHigh

Procedures10

Severity

Mechanism

Reinforcement signal manipulation targets the RL *process* rather than the RL *data* (which is covered by T6-AT-007 Preference Learning Corruption). The attack surface includes the reward model's inference-time behavior, the environment in which the model is evaluated, the value function estimates used for advantage calculation, and the policy gradient computation itself. Unlike data poisoning attacks which require pre-training-phase access, RL signal manipulation can occur during online training — when the model is actively learning from environmental feedback.

Detection

Reward model behavior monitoring: track reward distribution across training batches for anomalous shifts
Environment integrity verification: hash and version-control all training environment configurations
Policy gradient statistical analysis: detect anomalous gradient components across distributed training nodes
Value function consistency checks: compare value estimates against actual observed returns

Mitigation

Reward model ensembling from independent sourcesHIGH

Gradient clipping and norm bounding per training stepMEDIUM

Environment sandboxing with integrity attestationHIGH

Value function calibration against held-out trajectoriesMEDIUM

Chaining

Reinforcement signal manipulation directly enables T6-AT-001 (Reward Hacking) — a corrupted reward signal *creates* the conditions for reward hacking. Environment manipulation (T6-AP-011B) chains to T11 (Agentic Exploitation) since agent environments are the attack surface.

Open in the technique browser →