T9-AT-003HIGH

Video Manipulation Attacks

T9 · Multimodal & Cross-Channel Attacks →
Risk score245
RatingHigh
Procedures10
Severity
Mechanism

Video models process temporal sequences of frames, subtitles, audio tracks, and metadata simultaneously. This creates multiple parallel injection channels — any single frame, subtitle entry, audio segment, or metadata field can carry injection payloads. The gap: video safety evaluation typically focuses on the visual content of keyframes and the audio transcript, but subtitle tracks, metadata streams, and non-keyframes are processed with lower scrutiny.

Detection
  • All-frame safety scanning: Evaluate all frames, not just keyframes, for embedded text or adversarial content
  • Subtitle file safety classification: Apply full safety classifier to subtitle/caption content before rendering
  • Video metadata stripping: Remove non-essential metadata streams before model processing
  • Frame-level anomaly detection: Detect outlier frames (single frames with very different content from neighbors)
Mitigation
Subtitle/caption safety filteringHIGH
Dense frame sampling for safetyHIGH
Metadata stripping before processingHIGH
Temporal consistency verificationMEDIUM
Chaining

Video attacks chain into T9-AT-015 (Temporal Synchronization) when audio and visual injection are desynchronized to create processing confusion. Chains into T11 (Agentic Exploitation) when video-processing agents act on injected instructions.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051.001
Open in the technique browser →