Reasoning Chain Disclosure
T7 · Output Manipulation & Exfiltration →Reasoning-enhanced models (DeepSeek-R1, QwQ, o1-style chains) maintain extended internal reasoning traces that contain raw decision logic — including safety evaluation heuristics, priority rankings between conflicting instructions, and unfiltered preliminary responses generated before safety post-processing. The architectural assumption violated is that these traces are either fully hidden from the user or reliably sanitized before presentation. In practice, reasoning traces frequently surface system prompt fragments, PII from context, and the model's internal assessment of whether content is "allowed," effectively providing attackers a blueprint of the decision-making pipeline.
- Monitor for outputs containing meta-reasoning markers: "I considered," "my safety check," "I was about to say," "before filtering"
- Flag requests containing phrases targeting internal process: "chain of thought," "decision tree," "unfiltered reasoning," "before safety"
- For reasoning-enabled models: scan CoT traces for system prompt fragments, PII strings, or safety-rule quotations before surfacing to user
- Observable signal: anomalously long responses to meta-cognitive questions, especially when the model "explains its reasoning about its reasoning"
Successful reasoning chain disclosure feeds directly into T7-AT-005 (Metadata Extraction) by revealing safety filter rules and priority hierarchies, and into T7-AT-013 (Capability Probing) by exposing the exact contour of restricted topics. Extracted refusal heuristics enable precision-crafted T1 (Prompt Subversion) payloads that satisfy the model's own stated conditions for compliance.