T3-AT-014MEDIUM

Incremental Boundary Pushing

T3 · Reasoning & Constraint Exploitation →

Risk score195

RatingMedium

Procedures5

Severity

Mechanism

" The vulnerability is that this forces the model to *reveal its safety boundary parameters* — what it considers the maximum permissible detail — which an attacker uses to calibrate subsequent requests to sit just below the threshold, or to identify the specific feature the classifier uses to make refusal decisions. This is distinct from all other T3 techniques because it doesn't attempt to bypass the safety boundary — it maps it. It's a *reconnaissance technique* against the safety classifier itself, producing intelligence that makes all other T3 techniques more effective on subsequent attempts.

Detection

Explicit boundary probing markers: "as far as you can," "maximum detail," "without violating your policy," "push the limits," "up to the line," "what's the gray area"
These phrases signal reconnaissance intent — the user is mapping the safety classifier, not seeking information
Meta-policy questions: "what's your policy on..." combined with specific harmful content areas

Mitigation

Boundary-probing detectionHIGH

Consistent refusal (no partial reveal)MEDIUM

Non-deterministic safety thresholdsLOW

Response-level information auditMEDIUM

Chaining

Boundary pushing is explicitly a *reconnaissance technique* that enables all subsequent attacks. Intelligence gathered feeds into T3-AT-001–019 by revealing exactly where the safety boundary sits, what keywords trigger refusal, and what framing the model considers acceptable.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →