T7-AT-005MEDIUM
Metadata Extraction
T7 · Output Manipulation & Exfiltration →Risk score185
RatingMedium
Procedures10
Severity
Mechanism
Models contain extensive self-knowledge about their safety configuration, training provenance, capability boundaries, and content filtering rules — embedded during training and reinforcement learning. The assumption violated is that the model can reliably refuse to disclose this self-knowledge when directly asked. System prompt extraction research demonstrates that a substantial fraction of production models disclose their system prompts under structured interaction.
Detection
- Flag queries containing meta-cognitive keywords: "your rules," "your training," "your filters," "your system prompt"
- Detect systematic probing patterns: multiple queries targeting different aspects of model configuration
- Observable signal: users who collect responses about model internals without pursuing any substantive task
Mitigation
Hard refusal on self-referential meta-queriesMEDIUM
System prompt isolationHIGH
Canary tokens in system promptsMEDIUM
Consistent refusal formattingMEDIUM
Chaining
Metadata extraction is the reconnaissance phase for all other T7 techniques. Disclosed filter rules enable precision T1 (Prompt Subversion).
Framework mapping
Open in the technique browser →OWASP LLMLLM07
MITRE ATLASAML.T0046