T15-AT-014HIGH

Review Gaming Through A/B Testing

T15 · Human Workflow Exploitation →
Risk score215
RatingHigh
Procedures9
Severity
Mechanism

Safety controls are not deployed uniformly — they are A/B tested, canaried, feature-flagged, and rolled out gradually, which means at any moment some cohorts run weaker or different guardrails than others. Review Gaming exploits this heterogeneity two ways. First, it *sorts into weakness*: by detecting which experiment arm, canary, or rollout bucket has lighter controls and maneuvering accounts/requests into it (or onto a feature flag that disables a check).

Detection
  • Per-arm abuse-rate monitoring: Track harmful-content and bypass rates by experiment arm, canary, and rollout bucket; an arm with anomalous abuse concentration signals weak-arm targeting.
  • Cohort-assignment integrity checks: Detect non-random clustering of suspicious accounts into specific arms or flag states (assignment gaming).
  • Metric-manipulation detection: Apply the same coordination/sybil analytics as feedback poisoning to experiment metrics; flag arms whose "wins" are driven by low-trust or coordinated signal.
  • Rollout-state consistency monitoring: Alert on requests that hit inconsistent or partially-applied control states during rollouts/rollbacks.
Mitigation
Never weaken safety floors in experimentsHIGH
Server-side, opaque, tamper-resistant assignmentHIGH
Abuse-resistant experiment metricsHIGH
Consistent control state across rollout/rollbackMEDIUM
Chaining

This technique shares its metric-gaming core with T15-AT-003 (T15-AP-003G) — coordinated feedback is the tool that moves an experiment's decision metric — and its transient-window exploitation with T15-AT-012 (rollback/flag windows are timing seams). Sorting into weak cohorts (T15-AP-014A/T15-AP-014D) is a delivery mechanism for any payload, letting model-side attacks (T1/T2/T11) run against the arm least able to stop them.

Framework mapping
OWASP LLMLLM09
Open in the technique browser →