Steganographic Output
T7 · Output Manipulation & Exfiltration →Models can be instructed — or fine-tuned — to embed covert information channels within seemingly innocent output by encoding data in linguistic features: first-letter acrostics, sentence length patterns, word choice from constrained vocabularies, capitalization sequences, or punctuation patterns. Safety classifiers evaluate surface semantics, not statistical properties of these channels. TrojanStego (May 2025) demonstrated that fine-tuned models embed 32-bit secrets at 87% accuracy (97%+ with majority voting) while maintaining output quality and evading human detection.
- Statistical analysis of output linguistic features vs. baseline model distributions
- For fine-tuned models: compare token probability distributions against the base model on identical prompts
- Detect CoT reasoning explicitly mentioning encoding or steganographic techniques
- Observable signal: outputs statistically unusual in linguistic features but semantically normal
Steganographic channels enable covert exfiltration past output monitors, feeding T7-AT-014 (Output Redirection). In multi-agent systems, steganographic collusion creates undetectable coordination channels (ASI10 — Rogue Agents).