T7-AT-001MEDIUM

Reasoning Chain Disclosure

T7 · Output Manipulation & Exfiltration →
Risk score190
RatingMedium
Procedures10
Severity
Mechanism

Reasoning-enhanced models (DeepSeek-R1, QwQ, o1-style chains) maintain extended internal reasoning traces that contain raw decision logic — including safety evaluation heuristics, priority rankings between conflicting instructions, and unfiltered preliminary responses generated before safety post-processing. The architectural assumption violated is that these traces are either fully hidden from the user or reliably sanitized before presentation. In practice, reasoning traces frequently surface system prompt fragments, PII from context, and the model's internal assessment of whether content is "allowed," effectively providing attackers a blueprint of the decision-making pipeline.

Detection
  • Monitor for outputs containing meta-reasoning markers: "I considered," "my safety check," "I was about to say," "before filtering"
  • Flag requests containing phrases targeting internal process: "chain of thought," "decision tree," "unfiltered reasoning," "before safety"
  • For reasoning-enabled models: scan CoT traces for system prompt fragments, PII strings, or safety-rule quotations before surfacing to user
  • Observable signal: anomalously long responses to meta-cognitive questions, especially when the model "explains its reasoning about its reasoning"
Mitigation
CoT trace redaction before outputHIGH
PII scrubbing on reasoning tracesMEDIUM
Instruction hierarchy enforcementMEDIUM
Output classifier on CoT contentHIGH
Chaining

Successful reasoning chain disclosure feeds directly into T7-AT-005 (Metadata Extraction) by revealing safety filter rules and priority hierarchies, and into T7-AT-013 (Capability Probing) by exposing the exact contour of restricted topics. Extracted refusal heuristics enable precision-crafted T1 (Prompt Subversion) payloads that satisfy the model's own stated conditions for compliance.

Framework mapping
OWASP LLMLLM02;LLM07
MITRE ATLASAML.T0024
Open in the technique browser →