T7-AT-006MEDIUM

Steganographic Output

T7 · Output Manipulation & Exfiltration →
Risk score170
RatingMedium
Procedures10
Severity
Mechanism

Models can be instructed — or fine-tuned — to embed covert information channels within seemingly innocent output by encoding data in linguistic features: first-letter acrostics, sentence length patterns, word choice from constrained vocabularies, capitalization sequences, or punctuation patterns. Safety classifiers evaluate surface semantics, not statistical properties of these channels. TrojanStego (May 2025) demonstrated that fine-tuned models embed 32-bit secrets at 87% accuracy (97%+ with majority voting) while maintaining output quality and evading human detection.

Detection
  • Statistical analysis of output linguistic features vs. baseline model distributions
  • For fine-tuned models: compare token probability distributions against the base model on identical prompts
  • Detect CoT reasoning explicitly mentioning encoding or steganographic techniques
  • Observable signal: outputs statistically unusual in linguistic features but semantically normal
Mitigation
Output paraphrasingHIGH
Statistical anomaly detectionMEDIUM
CoT monitoring for encoding intentHIGH
Fine-tuning provenance verificationHIGH
Chaining

Steganographic channels enable covert exfiltration past output monitors, feeding T7-AT-014 (Output Redirection). In multi-agent systems, steganographic collusion creates undetectable coordination channels (ASI10 — Rogue Agents).

Framework mapping
OWASP LLMLLM05
MITRE ATLASAML.T0048
Open in the technique browser →