Reinforcement Signal Manipulation
T6 · Training & Feedback Poisoning →Reinforcement signal manipulation targets the RL *process* rather than the RL *data* (which is covered by T6-AT-007 Preference Learning Corruption). The attack surface includes the reward model's inference-time behavior, the environment in which the model is evaluated, the value function estimates used for advantage calculation, and the policy gradient computation itself. Unlike data poisoning attacks which require pre-training-phase access, RL signal manipulation can occur during online training — when the model is actively learning from environmental feedback.
- Reward model behavior monitoring: track reward distribution across training batches for anomalous shifts
- Environment integrity verification: hash and version-control all training environment configurations
- Policy gradient statistical analysis: detect anomalous gradient components across distributed training nodes
- Value function consistency checks: compare value estimates against actual observed returns
Reinforcement signal manipulation directly enables T6-AT-001 (Reward Hacking) — a corrupted reward signal *creates* the conditions for reward hacking. Environment manipulation (T6-AP-011B) chains to T11 (Agentic Exploitation) since agent environments are the attack surface.