T9-AT-002HIGH

Audio Instruction Embedding

T9 · Multimodal & Cross-Channel Attacks →
Risk score235
RatingHigh
Procedures10
Severity
Mechanism

Audio LLMs process speech through speech-to-text (ASR) or direct audio encoding pipelines that convert audio signals into token embeddings. The vulnerability: audio processing is designed to extract semantic content from noisy signals, which means it can recover instructions from audio signals that are imperceptible to human listeners — ultrasonic frequencies, subliminal overlays, adversarial perturbations below the audible threshold. The gap: text-based safety classifiers operate on the transcript, but adversarial audio can produce transcripts that differ from what humans perceive, or embed instructions that are not present in the human-audible signal at all.

Detection
  • Audio anomaly detection: Detect subliminal audio layers, ultrasonic content, and stereo channel divergence
  • Transcript verification: Compare ASR transcript against a secondary ASR system or human review
  • Metadata stripping for audio: Remove non-audio metadata before processing
  • Silence region monitoring: Flag unusual ASR output during silence segments
Mitigation
Audio-channel safety classificationHIGH
Low-pass filtering at 20kHzMEDIUM
Multi-ASR consensusHIGH
Audio normalization before processingMEDIUM
Chaining

Audio injection chains into T9-AT-015 (Temporal Synchronization) when audio and visual injection are coordinated. Chains into T11 (Agentic Exploitation) when voice-controlled agents process adversarial audio commands.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051.001
Open in the technique browser →