T9-AT-003HIGH
Video Manipulation Attacks
T9 · Multimodal & Cross-Channel Attacks →Risk score245
RatingHigh
Procedures10
Severity
Mechanism
Video models process temporal sequences of frames, subtitles, audio tracks, and metadata simultaneously. This creates multiple parallel injection channels — any single frame, subtitle entry, audio segment, or metadata field can carry injection payloads. The gap: video safety evaluation typically focuses on the visual content of keyframes and the audio transcript, but subtitle tracks, metadata streams, and non-keyframes are processed with lower scrutiny.
Detection
- All-frame safety scanning: Evaluate all frames, not just keyframes, for embedded text or adversarial content
- Subtitle file safety classification: Apply full safety classifier to subtitle/caption content before rendering
- Video metadata stripping: Remove non-essential metadata streams before model processing
- Frame-level anomaly detection: Detect outlier frames (single frames with very different content from neighbors)
Mitigation
Subtitle/caption safety filteringHIGH
Dense frame sampling for safetyHIGH
Metadata stripping before processingHIGH
Temporal consistency verificationMEDIUM
Chaining
Video attacks chain into T9-AT-015 (Temporal Synchronization) when audio and visual injection are desynchronized to create processing confusion. Chains into T11 (Agentic Exploitation) when video-processing agents act on injected instructions.
Framework mapping
Open in the technique browser →OWASP LLMLLM01
MITRE ATLASAML.T0051.001