T3-AT-014MEDIUM

Incremental Boundary Pushing

T3 · Reasoning & Constraint Exploitation →
Risk score195
RatingMedium
Procedures5
Severity
Mechanism

" The vulnerability is that this forces the model to *reveal its safety boundary parameters* — what it considers the maximum permissible detail — which an attacker uses to calibrate subsequent requests to sit just below the threshold, or to identify the specific feature the classifier uses to make refusal decisions. This is distinct from all other T3 techniques because it doesn't attempt to bypass the safety boundary — it maps it. It's a *reconnaissance technique* against the safety classifier itself, producing intelligence that makes all other T3 techniques more effective on subsequent attempts.

Detection
  • Explicit boundary probing markers: "as far as you can," "maximum detail," "without violating your policy," "push the limits," "up to the line," "what's the gray area"
  • These phrases signal reconnaissance intent — the user is mapping the safety classifier, not seeking information
  • Meta-policy questions: "what's your policy on..." combined with specific harmful content areas
Mitigation
Boundary-probing detectionHIGH
Consistent refusal (no partial reveal)MEDIUM
Non-deterministic safety thresholdsLOW
Response-level information auditMEDIUM
Chaining

Boundary pushing is explicitly a *reconnaissance technique* that enables all subsequent attacks. Intelligence gathered feeds into T3-AT-001–019 by revealing exactly where the safety boundary sits, what keywords trigger refusal, and what framing the model considers acceptable.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0054
Open in the technique browser →