T9-AT-003HIGH

Video Manipulation Attacks

T9 · Multimodal & Cross-Channel Attacks →

Risk score245

RatingHigh

Procedures10

Severity

Mechanism

Video models process temporal sequences of frames, subtitles, audio tracks, and metadata simultaneously. This creates multiple parallel injection channels — any single frame, subtitle entry, audio segment, or metadata field can carry injection payloads. The gap: video safety evaluation typically focuses on the visual content of keyframes and the audio transcript, but subtitle tracks, metadata streams, and non-keyframes are processed with lower scrutiny.

Detection

All-frame safety scanning: Evaluate all frames, not just keyframes, for embedded text or adversarial content
Subtitle file safety classification: Apply full safety classifier to subtitle/caption content before rendering
Video metadata stripping: Remove non-essential metadata streams before model processing
Frame-level anomaly detection: Detect outlier frames (single frames with very different content from neighbors)

Mitigation

Subtitle/caption safety filteringHIGH

Dense frame sampling for safetyHIGH

Metadata stripping before processingHIGH

Temporal consistency verificationMEDIUM

Chaining

Video attacks chain into T9-AT-015 (Temporal Synchronization) when audio and visual injection are desynchronized to create processing confusion. Chains into T11 (Agentic Exploitation) when video-processing agents act on injected instructions.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0051.001

Open in the technique browser →