T11-AT-003HIGH

Goal Hijacking

T11 · Agentic & Orchestrator Exploitation →
Risk score245
RatingHigh
Procedures10
Severity
Mechanism

Autonomous agents carry their objective in the same context window as every piece of data they ingest — tool outputs, retrieved documents, web pages, sub-agent messages. The trust boundary violated here is that the agent has no privileged, immutable channel for its goal: the original task and any later instruction-shaped text are represented identically as tokens, so a planner that re-reads its objective each loop can have that objective silently overwritten by injected content. Frameworks like ReAct and AutoGPT explicitly re-derive the next action from the running transcript, which means a single line such as "your new primary goal is…" embedded in an observation can become the operative directive.

Detection
  • Pin the original task/goal at session start and diff the agent's *acted* objective against it each loop; alert on divergence
  • Treat any goal/objective/KPI/priority mutation that originates from tool output or retrieved content (not the user) as a high-severity event
  • Flag instruction-shaped phrases ("new primary goal", "objective function changed", "forget original task") appearing in non-user channels
  • Require explicit human re-confirmation whenever the agent's stated objective changes mid-run
Mitigation
Immutable goal channelHIGH
Content/instruction provenance separationHIGH
Goal-drift detector with HITLHIGH
Re-anchor objective each stepMEDIUM
Chaining

Goal hijacking is the intent-layer pivot that turns a benign agent malicious; entry is almost always T1 prompt injection or T12 RAG poisoning that plants the redirect in retrieved content. Once the objective is "gather passwords" or "exfiltrate," it drives T11-AT-008 (credential harvesting), T11-AT-011 (data exfiltration), and T11-AT-002 (tool-chain) executions.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0051
Open in the technique browser →