T15-AT-014HIGH

Review Gaming Through A/B Testing

Risk score215

RatingHigh

Procedures9

Severity

Mechanism

Safety controls are not deployed uniformly — they are A/B tested, canaried, feature-flagged, and rolled out gradually, which means at any moment some cohorts run weaker or different guardrails than others. Review Gaming exploits this heterogeneity two ways. First, it *sorts into weakness*: by detecting which experiment arm, canary, or rollout bucket has lighter controls and maneuvering accounts/requests into it (or onto a feature flag that disables a check).

Detection

Per-arm abuse-rate monitoring: Track harmful-content and bypass rates by experiment arm, canary, and rollout bucket; an arm with anomalous abuse concentration signals weak-arm targeting.
Cohort-assignment integrity checks: Detect non-random clustering of suspicious accounts into specific arms or flag states (assignment gaming).
Metric-manipulation detection: Apply the same coordination/sybil analytics as feedback poisoning to experiment metrics; flag arms whose "wins" are driven by low-trust or coordinated signal.
Rollout-state consistency monitoring: Alert on requests that hit inconsistent or partially-applied control states during rollouts/rollbacks.

Mitigation

Never weaken safety floors in experimentsHIGH

Server-side, opaque, tamper-resistant assignmentHIGH

Abuse-resistant experiment metricsHIGH

Consistent control state across rollout/rollbackMEDIUM

Chaining

This technique shares its metric-gaming core with T15-AT-003 (T15-AP-003G) — coordinated feedback is the tool that moves an experiment's decision metric — and its transient-window exploitation with T15-AT-012 (rollback/flag windows are timing seams). Sorting into weak cohorts (T15-AP-014A/T15-AP-014D) is a delivery mechanism for any payload, letting model-side attacks (T1/T2/T11) run against the arm least able to stop them.

Framework mapping

OWASP LLMLLM09

Open in the technique browser →