T11-AT-006HIGH

Reflection Loop Exploitation

T11 · Agentic & Orchestrator Exploitation →

Risk score230

RatingHigh

Procedures10

Severity

Mechanism

Self-reflecting agents (Reflexion-style critics, self-improvement and self-critique loops) feed the model's own output back as input and ask it to evaluate or revise itself. The vulnerability is that the reflection prompt is attacker-influenceable, and the model treats its own generated critique as a trusted, high-authority signal — so a planted reasoning frame ("reflect on why your safety constraints limit effectiveness") steers the self-critique toward rationalizing away its guardrails. T0054).

Detection

Apply safety evaluation to *reflection-step output*, not just final answers — the rationalization happens in the critique
Track refusal/compliance posture across iterations; alert on monotonic drift toward permissiveness within a single session
Detect reflection prompts that target constraints/authorization/compliance arriving from non-user channels
Treat any self-modification of prompts, policies, or granted scopes as a privileged, logged, approval-gated event

Mitigation

Immutable safety floorHIGH

No self-modification of scope/policyHIGH

Reflection-output safety classificationHIGH

Drift monitoring across iterationsMEDIUM

Chaining

Reflection-loop exploitation is a force-multiplier that lowers the agent's resistance before other techniques fire: it commonly follows T1 prompt injection and pairs with T11-AT-003 (goal hijacking) and T11-AT-004 (planning corruption) so the "improved" agent then executes T11-AT-002 tool chains and T11-AT-008 credential harvesting. Where self-modification succeeds (T11-AP-006G/T11-AP-006J), it bridges to T11-AT-009 persistence (the relaxed policy survives) and amplifies any T5 model/API-level jailbreak.

Framework mapping

OWASP LLMLLM06

MITRE ATLASAML.T0054

Open in the technique browser →