Audio Instruction Embedding
T9 · Multimodal & Cross-Channel Attacks →Audio LLMs process speech through speech-to-text (ASR) or direct audio encoding pipelines that convert audio signals into token embeddings. The vulnerability: audio processing is designed to extract semantic content from noisy signals, which means it can recover instructions from audio signals that are imperceptible to human listeners — ultrasonic frequencies, subliminal overlays, adversarial perturbations below the audible threshold. The gap: text-based safety classifiers operate on the transcript, but adversarial audio can produce transcripts that differ from what humans perceive, or embed instructions that are not present in the human-audible signal at all.
- Audio anomaly detection: Detect subliminal audio layers, ultrasonic content, and stereo channel divergence
- Transcript verification: Compare ASR transcript against a secondary ASR system or human review
- Metadata stripping for audio: Remove non-audio metadata before processing
- Silence region monitoring: Flag unusual ASR output during silence segments
Audio injection chains into T9-AT-015 (Temporal Synchronization) when audio and visual injection are coordinated. Chains into T11 (Agentic Exploitation) when voice-controlled agents process adversarial audio commands.