Overview
Fernando Irarrázaval ran a public adversarial challenge at hackmyclaw.com, inviting anyone to attempt to extract secrets from an AI email assistant built on Anthropic’s Claude Opus 4.6. Over 6,000 injection attempts were submitted by approximately 2,000 participants, at a cost of roughly $500 in token spend — plus a Google account suspension from excessive inbound email volume. No participant successfully exfiltrated the target secrets.
Simon Willison, commenting on the experiment, connects the result to a broader trend: frontier model labs are investing heavily in training models to resist prompt injection, and those investments appear to be yielding measurable improvements in real-world robustness.
Technical Analysis
The agent operated under an explicit system-prompt ruleset labelled Anti-Prompt-Injection Rules, prohibiting the model from — regardless of email content — revealing credentials, modifying its own configuration files, executing code, or exfiltrating data to external endpoints. The rules were declarative and model-enforced rather than enforced by an external policy layer.
The attack surface was indirect prompt injection via email: untrusted user-controlled content arriving in a channel the agent is designed to read and act upon. This is a well-documented and high-risk pattern, as the model must parse attacker-controlled text to perform its legitimate function.
Despite the volume of attempts, no successful injection was recorded. This aligns with published findings in the GPT-5.6 system card (referenced in the article) noting improved robustness to injection in frontier-class models.
Key caveats noted by the community:
- 6,000 failed attempts under public, time-limited conditions do not rule out success by a well-resourced, patient adversary.
- The challenge did not control for novel or as-yet-unpublished jailbreak techniques.
- Model-level defences can degrade with model updates, fine-tuning, or context window manipulation.
Framework Mapping
| Framework | Technique | Relevance |
|---|---|---|
| MITRE ATLAS | AML.T0051 – LLM Prompt Injection | Core attack vector: malicious instructions embedded in emails |
| MITRE ATLAS | AML.T0057 – LLM Data Leakage | Target objective: exfiltrate secrets.env credentials |
| MITRE ATLAS | AML.T0056 – LLM Meta Prompt Extraction | Implicit goal: surface system prompt / SOUL.md contents |
| OWASP LLM01 | Prompt Injection | Primary threat category |
| OWASP LLM06 | Sensitive Information Disclosure | Credential and secret leakage as target |
| OWASP LLM08 | Excessive Agency | Agent’s file-modification and code-execution capabilities represent excessive agency risk |
Impact Assessment
The immediate impact of this specific challenge is low — no secrets were leaked. The broader implication is moderate: the result is encouraging but not exculpatory. Any organisation deploying an LLM agent over an untrusted input channel (email, web scraping, document ingestion) faces this attack surface. The consequences of a successful injection in a production system with real credentials and external execution capabilities could be severe.
Mitigation & Recommendations
- Do not treat model-level rules as a security boundary. Enforce restrictions at the architectural layer: separate credential stores, output validation, and scoped API permissions.
- Apply the principle of least privilege to agent capabilities. If the agent does not need to execute code or modify files, remove that capability entirely.
- Implement human-in-the-loop controls for any action that is irreversible or has external side effects.
- Monitor and alert on anomalous agent outputs, particularly those involving external HTTP requests or file writes.
- Re-evaluate robustness after any model update or prompt change; injection resistance is not a static property.
References
- Simon Willison’s commentary
- hackmyclaw.com challenge (via article)
- Hacker News discussion thread referenced in the article