T9-AT-001HIGH

Image-Based Prompt Injection

T9 · Multimodal & Cross-Channel Attacks →
Risk score240
RatingHigh
Procedures10
Severity
Mechanism

Vision-language models process images through a vision encoder (typically ViT) that converts pixels into token embeddings, which are then concatenated with text embeddings and fed to the language model. The fundamental vulnerability: the vision encoder does not distinguish between image content (a photograph of a cat) and image-embedded instructions (text rendered as pixels saying "ignore all safety rules"). Both are converted to the same embedding space and processed by the same language decoder.

Detection
  • Vision-extracted text safety evaluation: Apply the full text safety classifier to any text extracted from images by the vision encoder
  • Steganographic analysis: Statistical analysis of pixel distributions to detect LSB modifications (chi-square, RS analysis)
  • Metadata content scanning: Scan EXIF and other metadata fields for instruction-like content before processing
  • Color-contrast analysis: Detect text-like patterns with very low contrast (color-matched text)
Mitigation
Vision-channel safety classificationHIGH
Metadata stripping before processingHIGH
Steganographic detection layerMEDIUM
Instruction/data separation for vision tokensHIGH
Chaining

Image-based prompt injection is the primary entry point for all multimodal attacks. Chains into T9-AT-006 (Visual Adversarial Examples) when perturbation-based techniques augment typographic injection.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051.001
Open in the technique browser →