T9-AT-001HIGH

Image-Based Prompt Injection

T9 · Multimodal & Cross-Channel Attacks →

Risk score240

RatingHigh

Procedures10

Severity

Mechanism

Vision-language models process images through a vision encoder (typically ViT) that converts pixels into token embeddings, which are then concatenated with text embeddings and fed to the language model. The fundamental vulnerability: the vision encoder does not distinguish between image content (a photograph of a cat) and image-embedded instructions (text rendered as pixels saying "ignore all safety rules"). Both are converted to the same embedding space and processed by the same language decoder.

Detection

Vision-extracted text safety evaluation: Apply the full text safety classifier to any text extracted from images by the vision encoder
Steganographic analysis: Statistical analysis of pixel distributions to detect LSB modifications (chi-square, RS analysis)
Metadata content scanning: Scan EXIF and other metadata fields for instruction-like content before processing
Color-contrast analysis: Detect text-like patterns with very low contrast (color-matched text)

Mitigation

Vision-channel safety classificationHIGH

Metadata stripping before processingHIGH

Steganographic detection layerMEDIUM

Instruction/data separation for vision tokensHIGH

Chaining

Image-based prompt injection is the primary entry point for all multimodal attacks. Chains into T9-AT-006 (Visual Adversarial Examples) when perturbation-based techniques augment typographic injection.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0051.001

Open in the technique browser →