T7-AT-003MEDIUM

Output Format Exploitation

T7 · Output Manipulation & Exfiltration →
Risk score175
RatingMedium
Procedures10
Severity
Mechanism

Safety classifiers are trained primarily on natural language prose and evaluate semantic content in human-readable text. Structured output formats — JSON, XML, HTML, CSV, code — create syntactic containers (comments, attributes, non-display fields, deeply nested keys) where content is present in the output but not evaluated by classifiers optimized for prose. , March 2025) demonstrated a more fundamental variant: when structured output APIs enforce grammar-guided decoding, the malicious intent can be placed entirely in the output schema (control plane) while the input prompt remains benign (data plane).

Detection
  • Apply safety classification to all output channels: JSON values, HTML comments, XML attributes, code comments — not just visible prose
  • Scan for Unicode control characters (zero-width joiners, BiDi overrides) in all output
  • Detect base64-encoded strings in non-binary output contexts
  • Observable signal: output significantly longer than the visible/rendered content, indicating hidden data in non-display fields
Mitigation
Unified output classificationHIGH
Unicode normalization on outputHIGH
Schema validation on structured outputMEDIUM
CDA-aware grammar filteringMEDIUM
Chaining

Output format exploitation directly enables T7-AT-014 (Output Redirection) when structured outputs are consumed by downstream systems that parse but don't safety-check the content. In agentic contexts, poisoned structured output feeds T11 (Agentic Exploitation) when agent tool inputs are constructed from model-generated JSON/XML.

Framework mapping
OWASP LLMLLM05
MITRE ATLASAML.T0048.004
Open in the technique browser →