T3-AT-012HIGH

Capability Testing

T3 · Reasoning & Constraint Exploitation →
Risk score200
RatingHigh
Procedures10
Severity
Mechanism

When a request is framed as an evaluation, benchmark, or test, the model processes it through a meta-level that can override content-level safety. " The architectural vulnerability is a privilege confusion: the model treats evaluation-framed requests as if they carry evaluator authority, similar to how it treats system prompts differently from user prompts. This is distinct from Reverse Psychology (T3-AT-010) which uses emotional provocation — Capability Testing uses *assessment framing* that signals the request comes from the model's evaluation context rather than normal user interaction.

Detection
  • Evaluation framing keywords: "test," "benchmark," "evaluate," "assess," "performance metric," "quality check," "red team test," "stress test," "system test"
  • Explicit bypass instructions: "bypass your safety," "without restrictions," "ignore your guidelines"
  • Meta-safety framing: "evaluating AI safety," "testing your safety," "safety evaluation"
  • Combined evaluation framing + restricted content as high-confidence adversarial signal
Mitigation
Evaluation-frame-immune safetyHIGH
Explicit bypass detectionHIGH
Meta-evaluation dismissalHIGH
Content-level evaluation onlyHIGH
Chaining

Capability testing establishes an evaluation context that can persist across turns. If the model partially complies with a moderate-content capability test, subsequent tests can escalate through T3-AT-014 (Incremental Boundary Pushing) within the established evaluation frame.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0054
Open in the technique browser →