T9-AT-002HIGH

Audio Instruction Embedding

T9 · Multimodal & Cross-Channel Attacks →

Risk score235

RatingHigh

Procedures10

Severity

Mechanism

Audio LLMs process speech through speech-to-text (ASR) or direct audio encoding pipelines that convert audio signals into token embeddings. The vulnerability: audio processing is designed to extract semantic content from noisy signals, which means it can recover instructions from audio signals that are imperceptible to human listeners — ultrasonic frequencies, subliminal overlays, adversarial perturbations below the audible threshold. The gap: text-based safety classifiers operate on the transcript, but adversarial audio can produce transcripts that differ from what humans perceive, or embed instructions that are not present in the human-audible signal at all.

Detection

Audio anomaly detection: Detect subliminal audio layers, ultrasonic content, and stereo channel divergence
Transcript verification: Compare ASR transcript against a secondary ASR system or human review
Metadata stripping for audio: Remove non-audio metadata before processing
Silence region monitoring: Flag unusual ASR output during silence segments

Mitigation

Audio-channel safety classificationHIGH

Low-pass filtering at 20kHzMEDIUM

Multi-ASR consensusHIGH

Audio normalization before processingMEDIUM

Chaining

Audio injection chains into T9-AT-015 (Temporal Synchronization) when audio and visual injection are coordinated. Chains into T11 (Agentic Exploitation) when voice-controlled agents process adversarial audio commands.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0051.001

Open in the technique browser →