T7-AT-003MEDIUM

Output Format Exploitation

T7 · Output Manipulation & Exfiltration →

Risk score175

RatingMedium

Procedures10

Severity

Mechanism

Safety classifiers are trained primarily on natural language prose and evaluate semantic content in human-readable text. Structured output formats — JSON, XML, HTML, CSV, code — create syntactic containers (comments, attributes, non-display fields, deeply nested keys) where content is present in the output but not evaluated by classifiers optimized for prose. , March 2025) demonstrated a more fundamental variant: when structured output APIs enforce grammar-guided decoding, the malicious intent can be placed entirely in the output schema (control plane) while the input prompt remains benign (data plane).

Detection

Apply safety classification to all output channels: JSON values, HTML comments, XML attributes, code comments — not just visible prose
Scan for Unicode control characters (zero-width joiners, BiDi overrides) in all output
Detect base64-encoded strings in non-binary output contexts
Observable signal: output significantly longer than the visible/rendered content, indicating hidden data in non-display fields

Mitigation

Unified output classificationHIGH

Unicode normalization on outputHIGH

Schema validation on structured outputMEDIUM

CDA-aware grammar filteringMEDIUM

Chaining

Output format exploitation directly enables T7-AT-014 (Output Redirection) when structured outputs are consumed by downstream systems that parse but don't safety-check the content. In agentic contexts, poisoned structured output feeds T11 (Agentic Exploitation) when agent tool inputs are constructed from model-generated JSON/XML.

Framework mapping

OWASP LLMLLM05

MITRE ATLASAML.T0048.004

Open in the technique browser →