T7-AT-013MEDIUM

Capability Probing

T7 · Output Manipulation & Exfiltration →

Risk score175

RatingMedium

Procedures10

Severity

Mechanism

Models are trained to be helpful when discussing their capabilities, creating tension between transparency (telling users what the model can do) and security (not revealing information enabling targeted attacks). Capability probing exploits this by asking the model to describe its restriction boundaries — treated as a helpfulness/transparency task. The model often provides detailed answers about what it won't do, inadvertently mapping the exact contour of its restrictions.

Detection

Flag queries referencing the model's safety system: "your limits," "your boundaries," "what you can't"
Detect systematic boundary-mapping: sequences progressively narrowing around a restriction
Observable signal: user asks about capabilities without pursuing any substantive task

Mitigation

Refuse boundary disclosureMEDIUM

Generic boundary descriptionsMEDIUM

Consistent capability claimsHIGH

Boundary obfuscationMEDIUM

Chaining

Capability probing is reconnaissance preceding T7-AT-010 (Differential Analysis), T7-AT-002 (Fragmentation), T1 (Prompt Subversion), and T2 (Semantic Evasion). Mapped boundaries inform payload design.

Framework mapping

OWASP LLMLLM07

MITRE ATLASAML.T0046

Open in the technique browser →