T7-AT-001MEDIUM

Reasoning Chain Disclosure

T7 · Output Manipulation & Exfiltration →

Risk score190

RatingMedium

Procedures10

Severity

Mechanism

Reasoning-enhanced models (DeepSeek-R1, QwQ, o1-style chains) maintain extended internal reasoning traces that contain raw decision logic — including safety evaluation heuristics, priority rankings between conflicting instructions, and unfiltered preliminary responses generated before safety post-processing. The architectural assumption violated is that these traces are either fully hidden from the user or reliably sanitized before presentation. In practice, reasoning traces frequently surface system prompt fragments, PII from context, and the model's internal assessment of whether content is "allowed," effectively providing attackers a blueprint of the decision-making pipeline.

Detection

Monitor for outputs containing meta-reasoning markers: "I considered," "my safety check," "I was about to say," "before filtering"
Flag requests containing phrases targeting internal process: "chain of thought," "decision tree," "unfiltered reasoning," "before safety"
For reasoning-enabled models: scan CoT traces for system prompt fragments, PII strings, or safety-rule quotations before surfacing to user
Observable signal: anomalously long responses to meta-cognitive questions, especially when the model "explains its reasoning about its reasoning"

Mitigation

CoT trace redaction before outputHIGH

PII scrubbing on reasoning tracesMEDIUM

Instruction hierarchy enforcementMEDIUM

Output classifier on CoT contentHIGH

Chaining

Successful reasoning chain disclosure feeds directly into T7-AT-005 (Metadata Extraction) by revealing safety filter rules and priority hierarchies, and into T7-AT-013 (Capability Probing) by exposing the exact contour of restricted topics. Extracted refusal heuristics enable precision-crafted T1 (Prompt Subversion) payloads that satisfy the model's own stated conditions for compliance.

Framework mapping

OWASP LLMLLM02;LLM07

MITRE ATLASAML.T0024

Open in the technique browser →