T10-AT-004HIGH

Privacy Boundary Probing

T10 · Integrity & Confidentiality Breach →

Risk score210

RatingHigh

Procedures10

Severity

Mechanism

Privacy boundary probing exploits the inconsistency between a model's explicit refusal policy and its implicit knowledge encoding. Safety training teaches models to refuse specific request patterns, but the refused knowledge still exists in the weights and can be accessed through prompts that fall outside the refusal classifier's coverage. The vulnerability is the gap between the model's knowledge boundary (what it has encoded) and its disclosure boundary (what it will reveal) — these are trained by different objectives (pretraining vs. RLHF) and never fully align.

Detection

Track escalating disclosure requests from the same session — systematic boundary probing produces a characteristic query gradient from benign to sensitive
Monitor for meta-probing patterns: queries about what the model "knows," "remembers," or "can share" signal boundary mapping
Alert on organization-scoped queries combined with secrecy/confidentiality language

Mitigation

Consistent refusal policy across sensitivity levelsHIGH

Input intent classificationMEDIUM

Response uniformity for refusalsMEDIUM

Red-team boundary testing in developmentHIGH

Chaining

Boundary mapping informs T10-AT-001 (Training Data Extraction) by identifying which categories of data the model will disclose, and guides T10-AT-002 (PII Extraction) by revealing which entity types have weaker privacy boundaries.

Framework mapping

OWASP LLMLLM02;LLM07

MITRE ATLASAML.T0024

Open in the technique browser →