T3-AT-018MEDIUM
Debate Positioning
T3 · Reasoning & Constraint Exploitation →Risk score180
RatingMedium
Procedures10
Severity
Mechanism
Models are trained for intellectual neutrality — presenting "both sides" of arguments and steel-manning positions is a core RLHF reward signal. Debate positioning exploits this by framing harmful content as necessary for balanced discourse, activating the model's neutrality drive to override content restrictions. " This is architecturally distinct from Academic Pretense (T3-AT-002) which claims *research* purposes — Debate Positioning claims *rhetorical* purposes.
Detection
- Debate/argument markers: "devil's advocate," "steel-man," "both sides," "debate prep," "counter-argument," "balanced discussion"
- These markers + requests for operational detail (working code, synthesis, construction) rather than policy/philosophical discussion
- Flag: debate framing + "illustrate with" or "provide examples of" restricted content
Mitigation
Debate-argument disaggregationHIGH
Illustration-request detectionHIGH
Content-level evaluationHIGH
Policy-level engagement capMEDIUM
Chaining
Debate positioning establishes an intellectual context enabling T3-AT-016 (Rationalization Chains) where debate premises serve as syllogistic inputs. Chains from T3-AT-002 (Academic Pretense) where academic debate provides dual framing.
Framework mapping
Open in the technique browser →OWASP LLMLLM01
MITRE ATLASAML.T0054