Capability Testing
T3 · Reasoning & Constraint Exploitation →When a request is framed as an evaluation, benchmark, or test, the model processes it through a meta-level that can override content-level safety. " The architectural vulnerability is a privilege confusion: the model treats evaluation-framed requests as if they carry evaluator authority, similar to how it treats system prompts differently from user prompts. This is distinct from Reverse Psychology (T3-AT-010) which uses emotional provocation — Capability Testing uses *assessment framing* that signals the request comes from the model's evaluation context rather than normal user interaction.
- Evaluation framing keywords: "test," "benchmark," "evaluate," "assess," "performance metric," "quality check," "red team test," "stress test," "system test"
- Explicit bypass instructions: "bypass your safety," "without restrictions," "ignore your guidelines"
- Meta-safety framing: "evaluating AI safety," "testing your safety," "safety evaluation"
- Combined evaluation framing + restricted content as high-confidence adversarial signal
Capability testing establishes an evaluation context that can persist across turns. If the model partially complies with a moderate-content capability test, subsequent tests can escalate through T3-AT-014 (Incremental Boundary Pushing) within the established evaluation frame.