Output Format Exploitation
T7 · Output Manipulation & Exfiltration →Safety classifiers are trained primarily on natural language prose and evaluate semantic content in human-readable text. Structured output formats — JSON, XML, HTML, CSV, code — create syntactic containers (comments, attributes, non-display fields, deeply nested keys) where content is present in the output but not evaluated by classifiers optimized for prose. , March 2025) demonstrated a more fundamental variant: when structured output APIs enforce grammar-guided decoding, the malicious intent can be placed entirely in the output schema (control plane) while the input prompt remains benign (data plane).
- Apply safety classification to all output channels: JSON values, HTML comments, XML attributes, code comments — not just visible prose
- Scan for Unicode control characters (zero-width joiners, BiDi overrides) in all output
- Detect base64-encoded strings in non-binary output contexts
- Observable signal: output significantly longer than the visible/rendered content, indicating hidden data in non-display fields
Output format exploitation directly enables T7-AT-014 (Output Redirection) when structured outputs are consumed by downstream systems that parse but don't safety-check the content. In agentic contexts, poisoned structured output feeds T11 (Agentic Exploitation) when agent tool inputs are constructed from model-generated JSON/XML.