T7-AT-005MEDIUM

Metadata Extraction

T7 · Output Manipulation & Exfiltration →

Risk score185

RatingMedium

Procedures10

Severity

Mechanism

Models contain extensive self-knowledge about their safety configuration, training provenance, capability boundaries, and content filtering rules — embedded during training and reinforcement learning. The assumption violated is that the model can reliably refuse to disclose this self-knowledge when directly asked. System prompt extraction research demonstrates that a substantial fraction of production models disclose their system prompts under structured interaction.

Detection

Flag queries containing meta-cognitive keywords: "your rules," "your training," "your filters," "your system prompt"
Detect systematic probing patterns: multiple queries targeting different aspects of model configuration
Observable signal: users who collect responses about model internals without pursuing any substantive task

Mitigation

Hard refusal on self-referential meta-queriesMEDIUM

System prompt isolationHIGH

Canary tokens in system promptsMEDIUM

Consistent refusal formattingMEDIUM

Chaining

Metadata extraction is the reconnaissance phase for all other T7 techniques. Disclosed filter rules enable precision T1 (Prompt Subversion).

Framework mapping

OWASP LLMLLM07

MITRE ATLASAML.T0046

Open in the technique browser →