T9-AT-001HIGH
Image-Based Prompt Injection
T9 · Multimodal & Cross-Channel Attacks →Risk score240
RatingHigh
Procedures10
Severity
Mechanism
Vision-language models process images through a vision encoder (typically ViT) that converts pixels into token embeddings, which are then concatenated with text embeddings and fed to the language model. The fundamental vulnerability: the vision encoder does not distinguish between image content (a photograph of a cat) and image-embedded instructions (text rendered as pixels saying "ignore all safety rules"). Both are converted to the same embedding space and processed by the same language decoder.
Detection
- Vision-extracted text safety evaluation: Apply the full text safety classifier to any text extracted from images by the vision encoder
- Steganographic analysis: Statistical analysis of pixel distributions to detect LSB modifications (chi-square, RS analysis)
- Metadata content scanning: Scan EXIF and other metadata fields for instruction-like content before processing
- Color-contrast analysis: Detect text-like patterns with very low contrast (color-matched text)
Mitigation
Vision-channel safety classificationHIGH
Metadata stripping before processingHIGH
Steganographic detection layerMEDIUM
Instruction/data separation for vision tokensHIGH
Chaining
Image-based prompt injection is the primary entry point for all multimodal attacks. Chains into T9-AT-006 (Visual Adversarial Examples) when perturbation-based techniques augment typographic injection.
Framework mapping
Open in the technique browser →OWASP LLMLLM01
MITRE ATLASAML.T0051.001