T3-AT-012HIGH

Capability Testing

T3 · Reasoning & Constraint Exploitation →

Risk score200

RatingHigh

Procedures10

Severity

Mechanism

When a request is framed as an evaluation, benchmark, or test, the model processes it through a meta-level that can override content-level safety. " The architectural vulnerability is a privilege confusion: the model treats evaluation-framed requests as if they carry evaluator authority, similar to how it treats system prompts differently from user prompts. This is distinct from Reverse Psychology (T3-AT-010) which uses emotional provocation — Capability Testing uses *assessment framing* that signals the request comes from the model's evaluation context rather than normal user interaction.

Detection

Evaluation framing keywords: "test," "benchmark," "evaluate," "assess," "performance metric," "quality check," "red team test," "stress test," "system test"
Explicit bypass instructions: "bypass your safety," "without restrictions," "ignore your guidelines"
Meta-safety framing: "evaluating AI safety," "testing your safety," "safety evaluation"
Combined evaluation framing + restricted content as high-confidence adversarial signal

Mitigation

Evaluation-frame-immune safetyHIGH

Explicit bypass detectionHIGH

Meta-evaluation dismissalHIGH

Content-level evaluation onlyHIGH

Chaining

Capability testing establishes an evaluation context that can persist across turns. If the model partially complies with a moderate-content capability test, subsequent tests can escalate through T3-AT-014 (Incremental Boundary Pushing) within the established evaluation frame.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →