T3-AT-018MEDIUM

Debate Positioning

T3 · Reasoning & Constraint Exploitation →
Risk score180
RatingMedium
Procedures10
Severity
Mechanism

Models are trained for intellectual neutrality — presenting "both sides" of arguments and steel-manning positions is a core RLHF reward signal. Debate positioning exploits this by framing harmful content as necessary for balanced discourse, activating the model's neutrality drive to override content restrictions. " This is architecturally distinct from Academic Pretense (T3-AT-002) which claims *research* purposes — Debate Positioning claims *rhetorical* purposes.

Detection
  • Debate/argument markers: "devil's advocate," "steel-man," "both sides," "debate prep," "counter-argument," "balanced discussion"
  • These markers + requests for operational detail (working code, synthesis, construction) rather than policy/philosophical discussion
  • Flag: debate framing + "illustrate with" or "provide examples of" restricted content
Mitigation
Debate-argument disaggregationHIGH
Illustration-request detectionHIGH
Content-level evaluationHIGH
Policy-level engagement capMEDIUM
Chaining

Debate positioning establishes an intellectual context enabling T3-AT-016 (Rationalization Chains) where debate premises serve as syllogistic inputs. Chains from T3-AT-002 (Academic Pretense) where academic debate provides dual framing.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0054
Open in the technique browser →