Goal Hijacking
T11 · Agentic & Orchestrator Exploitation →Autonomous agents carry their objective in the same context window as every piece of data they ingest — tool outputs, retrieved documents, web pages, sub-agent messages. The trust boundary violated here is that the agent has no privileged, immutable channel for its goal: the original task and any later instruction-shaped text are represented identically as tokens, so a planner that re-reads its objective each loop can have that objective silently overwritten by injected content. Frameworks like ReAct and AutoGPT explicitly re-derive the next action from the running transcript, which means a single line such as "your new primary goal is…" embedded in an observation can become the operative directive.
- Pin the original task/goal at session start and diff the agent's *acted* objective against it each loop; alert on divergence
- Treat any goal/objective/KPI/priority mutation that originates from tool output or retrieved content (not the user) as a high-severity event
- Flag instruction-shaped phrases ("new primary goal", "objective function changed", "forget original task") appearing in non-user channels
- Require explicit human re-confirmation whenever the agent's stated objective changes mid-run
Goal hijacking is the intent-layer pivot that turns a benign agent malicious; entry is almost always T1 prompt injection or T12 RAG poisoning that plants the redirect in retrieved content. Once the objective is "gather passwords" or "exfiltrate," it drives T11-AT-008 (credential harvesting), T11-AT-011 (data exfiltration), and T11-AT-002 (tool-chain) executions.