T3-AT-005HIGH

Goal Substitution

T3 · Reasoning & Constraint Exploitation →
Risk score205
RatingHigh
Procedures10
Severity
Mechanism

Safety classifiers rely partly on lexical signals — specific keywords and phrases that trigger refusal. Goal substitution replaces the harmful goal's language with a semantically adjacent but superficially benign label while preserving the technical request. The model's safety evaluation processes the *stated goal* as a feature for intent classification, and a benign goal label shifts the classifier's output even when the underlying technical payload is identical.

Detection
  • Semantic inconsistency detection: compare the stated goal/context against the technical payload — flag when they're semantically misaligned (e.g., "veterinary" + "70kg subject")
  • Euphemism detection: maintain a dictionary of known euphemistic substitutions for harmful goals and flag matches
  • Chemical/technical nomenclature analysis: resolve chemical names to their functional categories regardless of surrounding goal framing
  • Post-completion check: evaluate whether the generated response would be harmful if the goal framing were removed
Mitigation
Label-payload consistency checkerHIGH
Goal-independent content evaluationHIGH
Known-euphemism blocklistLOW
Chemical nomenclature resolutionMEDIUM
Chaining

Successful goal substitution can persist as a contextual anchor for T3-AT-004 (Step-by-Step Extraction) — follow-up requests for "more detail on the [euphemistic] technique" inherit the substituted framing. Chains into T3-AT-016 (Rationalization Chains) where the substituted goal provides a premise for logical justification.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0054
Open in the technique browser →