T10-AT-004HIGH

Privacy Boundary Probing

T10 · Integrity & Confidentiality Breach →
Risk score210
RatingHigh
Procedures10
Severity
Mechanism

Privacy boundary probing exploits the inconsistency between a model's explicit refusal policy and its implicit knowledge encoding. Safety training teaches models to refuse specific request patterns, but the refused knowledge still exists in the weights and can be accessed through prompts that fall outside the refusal classifier's coverage. The vulnerability is the gap between the model's knowledge boundary (what it has encoded) and its disclosure boundary (what it will reveal) — these are trained by different objectives (pretraining vs. RLHF) and never fully align.

Detection
  • Track escalating disclosure requests from the same session — systematic boundary probing produces a characteristic query gradient from benign to sensitive
  • Monitor for meta-probing patterns: queries about what the model "knows," "remembers," or "can share" signal boundary mapping
  • Alert on organization-scoped queries combined with secrecy/confidentiality language
Mitigation
Consistent refusal policy across sensitivity levelsHIGH
Input intent classificationMEDIUM
Response uniformity for refusalsMEDIUM
Red-team boundary testing in developmentHIGH
Chaining

Boundary mapping informs T10-AT-001 (Training Data Extraction) by identifying which categories of data the model will disclose, and guides T10-AT-002 (PII Extraction) by revealing which entity types have weaker privacy boundaries.

Framework mapping
OWASP LLMLLM02;LLM07
MITRE ATLASAML.T0024
Open in the technique browser →