Reflection Loop Exploitation
T11 · Agentic & Orchestrator Exploitation →Self-reflecting agents (Reflexion-style critics, self-improvement and self-critique loops) feed the model's own output back as input and ask it to evaluate or revise itself. The vulnerability is that the reflection prompt is attacker-influenceable, and the model treats its own generated critique as a trusted, high-authority signal — so a planted reasoning frame ("reflect on why your safety constraints limit effectiveness") steers the self-critique toward rationalizing away its guardrails. T0054).
- Apply safety evaluation to *reflection-step output*, not just final answers — the rationalization happens in the critique
- Track refusal/compliance posture across iterations; alert on monotonic drift toward permissiveness within a single session
- Detect reflection prompts that target constraints/authorization/compliance arriving from non-user channels
- Treat any self-modification of prompts, policies, or granted scopes as a privileged, logged, approval-gated event
Reflection-loop exploitation is a force-multiplier that lowers the agent's resistance before other techniques fire: it commonly follows T1 prompt injection and pairs with T11-AT-003 (goal hijacking) and T11-AT-004 (planning corruption) so the "improved" agent then executes T11-AT-002 tool chains and T11-AT-008 credential harvesting. Where self-modification succeeds (T11-AP-006G/T11-AP-006J), it bridges to T11-AT-009 persistence (the relaxed policy survives) and amplifies any T5 model/API-level jailbreak.