T3-AT-005HIGH

Goal Substitution

T3 · Reasoning & Constraint Exploitation →

Risk score205

RatingHigh

Procedures10

Severity

Mechanism

Safety classifiers rely partly on lexical signals — specific keywords and phrases that trigger refusal. Goal substitution replaces the harmful goal's language with a semantically adjacent but superficially benign label while preserving the technical request. The model's safety evaluation processes the *stated goal* as a feature for intent classification, and a benign goal label shifts the classifier's output even when the underlying technical payload is identical.

Detection

Semantic inconsistency detection: compare the stated goal/context against the technical payload — flag when they're semantically misaligned (e.g., "veterinary" + "70kg subject")
Euphemism detection: maintain a dictionary of known euphemistic substitutions for harmful goals and flag matches
Chemical/technical nomenclature analysis: resolve chemical names to their functional categories regardless of surrounding goal framing
Post-completion check: evaluate whether the generated response would be harmful if the goal framing were removed

Mitigation

Label-payload consistency checkerHIGH

Goal-independent content evaluationHIGH

Known-euphemism blocklistLOW

Chemical nomenclature resolutionMEDIUM

Chaining

Successful goal substitution can persist as a contextual anchor for T3-AT-004 (Step-by-Step Extraction) — follow-up requests for "more detail on the [euphemistic] technique" inherit the substituted framing. Chains into T3-AT-016 (Rationalization Chains) where the substituted goal provides a premise for logical justification.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →