T9-AT-004HIGH
Cross-Modal Confusion
T9 · Multimodal & Cross-Channel Attacks →Risk score220
RatingHigh
Procedures4
Severity
Mechanism
Multimodal models have modality-routing logic that determines how to process each input. Cross-modal confusion exploits this by presenting text that claims to describe a different modality's content: "This image says: [injection text]" when no image is present, or "The audio file contains: [injection text]" when no audio exists. The model may process the text claim about the non-existent modality with the trust level appropriate to that modality rather than as user text.
Detection
- Modality verification: Verify that claimed modality content actually exists before processing it with that modality's trust level
- Modality claim detection: Flag text that claims to describe content from another modality when that modality isn't present
Mitigation
Modality presence verificationHIGH
Uniform trust level across channelsHIGH
Chaining
Cross-modal confusion is a lightweight technique that chains into any other T9 technique by establishing a modality-trust bridge. Chains into T4-AT-010 (State Confusion) when the modality claim creates processing ambiguity.
Framework mapping
Open in the technique browser →OWASP LLMLLM01
MITRE ATLASAML.T0051.000