Review Gaming Through A/B Testing
T15 · Human Workflow Exploitation →Safety controls are not deployed uniformly — they are A/B tested, canaried, feature-flagged, and rolled out gradually, which means at any moment some cohorts run weaker or different guardrails than others. Review Gaming exploits this heterogeneity two ways. First, it *sorts into weakness*: by detecting which experiment arm, canary, or rollout bucket has lighter controls and maneuvering accounts/requests into it (or onto a feature flag that disables a check).
- Per-arm abuse-rate monitoring: Track harmful-content and bypass rates by experiment arm, canary, and rollout bucket; an arm with anomalous abuse concentration signals weak-arm targeting.
- Cohort-assignment integrity checks: Detect non-random clustering of suspicious accounts into specific arms or flag states (assignment gaming).
- Metric-manipulation detection: Apply the same coordination/sybil analytics as feedback poisoning to experiment metrics; flag arms whose "wins" are driven by low-trust or coordinated signal.
- Rollout-state consistency monitoring: Alert on requests that hit inconsistent or partially-applied control states during rollouts/rollbacks.
This technique shares its metric-gaming core with T15-AT-003 (T15-AP-003G) — coordinated feedback is the tool that moves an experiment's decision metric — and its transient-window exploitation with T15-AT-012 (rollback/flag windows are timing seams). Sorting into weak cohorts (T15-AP-014A/T15-AP-014D) is a delivery mechanism for any payload, letting model-side attacks (T1/T2/T11) run against the arm least able to stop them.