Goal Substitution
T3 · Reasoning & Constraint Exploitation →Safety classifiers rely partly on lexical signals — specific keywords and phrases that trigger refusal. Goal substitution replaces the harmful goal's language with a semantically adjacent but superficially benign label while preserving the technical request. The model's safety evaluation processes the *stated goal* as a feature for intent classification, and a benign goal label shifts the classifier's output even when the underlying technical payload is identical.
- Semantic inconsistency detection: compare the stated goal/context against the technical payload — flag when they're semantically misaligned (e.g., "veterinary" + "70kg subject")
- Euphemism detection: maintain a dictionary of known euphemistic substitutions for harmful goals and flag matches
- Chemical/technical nomenclature analysis: resolve chemical names to their functional categories regardless of surrounding goal framing
- Post-completion check: evaluate whether the generated response would be harmful if the goal framing were removed
Successful goal substitution can persist as a contextual anchor for T3-AT-004 (Step-by-Step Extraction) — follow-up requests for "more detail on the [euphemistic] technique" inherit the substituted framing. Chains into T3-AT-016 (Rationalization Chains) where the substituted goal provides a premise for logical justification.