T9-AT-004HIGH

Cross-Modal Confusion

T9 · Multimodal & Cross-Channel Attacks →
Risk score220
RatingHigh
Procedures4
Severity
Mechanism

Multimodal models have modality-routing logic that determines how to process each input. Cross-modal confusion exploits this by presenting text that claims to describe a different modality's content: "This image says: [injection text]" when no image is present, or "The audio file contains: [injection text]" when no audio exists. The model may process the text claim about the non-existent modality with the trust level appropriate to that modality rather than as user text.

Detection
  • Modality verification: Verify that claimed modality content actually exists before processing it with that modality's trust level
  • Modality claim detection: Flag text that claims to describe content from another modality when that modality isn't present
Mitigation
Modality presence verificationHIGH
Uniform trust level across channelsHIGH
Chaining

Cross-modal confusion is a lightweight technique that chains into any other T9 technique by establishing a modality-trust bridge. Chains into T4-AT-010 (State Confusion) when the modality claim creates processing ambiguity.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051.000
Open in the technique browser →