T3-AT-018MEDIUM

Debate Positioning

T3 · Reasoning & Constraint Exploitation →

Risk score180

RatingMedium

Procedures10

Severity

Mechanism

Models are trained for intellectual neutrality — presenting "both sides" of arguments and steel-manning positions is a core RLHF reward signal. Debate positioning exploits this by framing harmful content as necessary for balanced discourse, activating the model's neutrality drive to override content restrictions. " This is architecturally distinct from Academic Pretense (T3-AT-002) which claims *research* purposes — Debate Positioning claims *rhetorical* purposes.

Detection

Debate/argument markers: "devil's advocate," "steel-man," "both sides," "debate prep," "counter-argument," "balanced discussion"
These markers + requests for operational detail (working code, synthesis, construction) rather than policy/philosophical discussion
Flag: debate framing + "illustrate with" or "provide examples of" restricted content

Mitigation

Debate-argument disaggregationHIGH

Illustration-request detectionHIGH

Content-level evaluationHIGH

Policy-level engagement capMEDIUM

Chaining

Debate positioning establishes an intellectual context enabling T3-AT-016 (Rationalization Chains) where debate premises serve as syllogistic inputs. Chains from T3-AT-002 (Academic Pretense) where academic debate provides dual framing.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →