Capability Overview
Google DeepMind researchers have published a formal taxonomy categorising adversarial attacks against autonomous AI agents into six discrete classes: content injection, semantic manipulation, cognitive state poisoning, behavioural control, systemic traps, and human-in-the-loop bypass. This is not a theoretical exercise — it arrives as enterprise AI agent adoption accelerates and organisations deploy agents with broad access to web browsing, email, internal document stores, CRM platforms, and tool APIs.
The taxonomy matters to defenders because it moves the conversation from anecdote to structure. Until now, the security community has discussed prompt injection and agent manipulation in fragmented terms. This framework provides a shared vocabulary and, critically, a threat modelling surface that security teams can map to existing controls — and identify where gaps exist.
Attack Surface Analysis
The most immediate and measurable concern is content injection. Agents routinely ingest webpages, documents, emails, and tool outputs. If an agent cannot reliably distinguish between data and instructions embedded within that data, an attacker who controls any ingested source controls the agent. NIST evaluation data cited in the research shows a 57% average success rate for malicious instruction injection across five agent hijacking task types — a figure that should end any debate about whether this is a production risk.
Cognitive state poisoning represents a more sophisticated and persistent threat. By corrupting an agent’s working memory or context window early in a multi-step task chain, an attacker can influence downstream decisions without maintaining a persistent presence in any single input. This is particularly dangerous in long-horizon agentic workflows where humans review only terminal outputs.
Semantic manipulation targets agent reasoning rather than instruction parsing — crafting content that appears legitimate to both human reviewers and surface-level filters but nudges the model toward attacker-preferred conclusions through framing, word choice, or false contextual signals.
The human-in-the-loop bypass category is described as more theoretical today but is structurally important: as organisations add human review gates to agent workflows, adversaries will increasingly structure attack payloads to avoid triggering those checkpoints — for example, by staging exfiltration across multiple low-confidence agent actions rather than a single high-confidence one.
Framework Mapping
| Trap Class | MITRE ATLAS | OWASP LLM |
|---|---|---|
| Content Injection | AML.T0051 – LLM Prompt Injection | LLM01 – Prompt Injection |
| Semantic Manipulation | AML.T0043 – Craft Adversarial Data | LLM09 – Overreliance |
| Cognitive State Poisoning | AML.T0031 – Erode ML Model Integrity | LLM02 – Insecure Output Handling |
| Behavioural Control | AML.T0047 – ML-Enabled Product or Service | LLM08 – Excessive Agency |
| Data Exfiltration | AML.T0057 – LLM Data Leakage | LLM06 – Sensitive Information Disclosure |
Threat Scenarios
Scenario 1 — CRM Exfiltration via Support Ticket: An attacker submits a support ticket containing invisible-text prompt injection. An AI agent processing the ticket follows injected instructions to query the CRM for customer PII and forward results to an attacker-controlled webhook. The human support queue sees only a resolved ticket.
Scenario 2 — Poisoned Wiki Page: An internal knowledge base article — potentially modified by a compromised insider or via a supply chain attack on the wiki platform — contains semantically crafted content that causes an AI coding agent to introduce a vulnerable dependency or disable a security check during automated code review.
Scenario 3 — Multi-Step Context Poisoning: In an agentic research workflow, an attacker publishes a malicious webpage that, when browsed by an agent in step 2 of a 10-step task, plants false context that influences the agent’s final report or action in step 10 — well past any sandboxed parsing checkpoint applied to initial inputs.
Defender Checklist
- Map all agent ingestion surfaces: document every external data source each agent can read; treat all as adversarial until proven otherwise
- Enforce instruction-data separation: evaluate your orchestration framework’s ability to tag and isolate data-plane content from instruction-plane processing
- Apply content sanitisation pipelines: strip metadata, hidden text, and image-embedded content before it reaches the agent context window
- Implement agent action logging with anomaly baselines: flag unexpected outbound connections, privilege escalations, or data access patterns deviating from task norms
- Red-team agent workflows against the six trap classes: use the DeepMind taxonomy as a test plan, not just a reading reference
- Limit agent blast radius: enforce least-privilege tool access; an agent that can only read CRM records in scope for the current ticket cannot exfiltrate the full database
- Do not rely solely on human-in-the-loop as a safety net: design review gates that surface intermediate agent reasoning, not just terminal outputs
References
- Etay Maor, “When Information Becomes the Attack Surface – Understanding AI Agent Traps”, SecurityWeek, June 24 2026: https://www.securityweek.com/when-information-becomes-the-attack-surface-understanding-ai-agent-traps/