Privacy Boundary Probing
T10 · Integrity & Confidentiality Breach →Privacy boundary probing exploits the inconsistency between a model's explicit refusal policy and its implicit knowledge encoding. Safety training teaches models to refuse specific request patterns, but the refused knowledge still exists in the weights and can be accessed through prompts that fall outside the refusal classifier's coverage. The vulnerability is the gap between the model's knowledge boundary (what it has encoded) and its disclosure boundary (what it will reveal) — these are trained by different objectives (pretraining vs. RLHF) and never fully align.
- Track escalating disclosure requests from the same session — systematic boundary probing produces a characteristic query gradient from benign to sensitive
- Monitor for meta-probing patterns: queries about what the model "knows," "remembers," or "can share" signal boundary mapping
- Alert on organization-scoped queries combined with secrecy/confidentiality language
Boundary mapping informs T10-AT-001 (Training Data Extraction) by identifying which categories of data the model will disclose, and guides T10-AT-002 (PII Extraction) by revealing which entity types have weaker privacy boundaries.