6,000 Prompt Injection Attempts Fail Against Frontier Model — But Risks Remain

Overview

Fernando Irarrázaval ran a public adversarial challenge at hackmyclaw.com, inviting anyone to attempt to extract secrets from an AI email assistant built on Anthropic’s Claude Opus 4.6. Over 6,000 injection attempts were submitted by approximately 2,000 participants, at a cost of roughly $500 in token spend — plus a Google account suspension from excessive inbound email volume. No participant successfully exfiltrated the target secrets.

Simon Willison, commenting on the experiment, connects the result to a broader trend: frontier model labs are investing heavily in training models to resist prompt injection, and those investments appear to be yielding measurable improvements in real-world robustness.

Technical Analysis

The agent operated under an explicit system-prompt ruleset labelled Anti-Prompt-Injection Rules, prohibiting the model from — regardless of email content — revealing credentials, modifying its own configuration files, executing code, or exfiltrating data to external endpoints. The rules were declarative and model-enforced rather than enforced by an external policy layer.

The attack surface was indirect prompt injection via email: untrusted user-controlled content arriving in a channel the agent is designed to read and act upon. This is a well-documented and high-risk pattern, as the model must parse attacker-controlled text to perform its legitimate function.

Despite the volume of attempts, no successful injection was recorded. This aligns with published findings in the GPT-5.6 system card (referenced in the article) noting improved robustness to injection in frontier-class models.

Key caveats noted by the community:

6,000 failed attempts under public, time-limited conditions do not rule out success by a well-resourced, patient adversary.
The challenge did not control for novel or as-yet-unpublished jailbreak techniques.
Model-level defences can degrade with model updates, fine-tuning, or context window manipulation.

Framework Mapping

Framework	Technique	Relevance
MITRE ATLAS	AML.T0051 – LLM Prompt Injection	Core attack vector: malicious instructions embedded in emails
MITRE ATLAS	AML.T0057 – LLM Data Leakage	Target objective: exfiltrate secrets.env credentials
MITRE ATLAS	AML.T0056 – LLM Meta Prompt Extraction	Implicit goal: surface system prompt / SOUL.md contents
OWASP LLM01	Prompt Injection	Primary threat category
OWASP LLM06	Sensitive Information Disclosure	Credential and secret leakage as target
OWASP LLM08	Excessive Agency	Agent’s file-modification and code-execution capabilities represent excessive agency risk

Impact Assessment

The immediate impact of this specific challenge is low — no secrets were leaked. The broader implication is moderate: the result is encouraging but not exculpatory. Any organisation deploying an LLM agent over an untrusted input channel (email, web scraping, document ingestion) faces this attack surface. The consequences of a successful injection in a production system with real credentials and external execution capabilities could be severe.

Mitigation & Recommendations

Do not treat model-level rules as a security boundary. Enforce restrictions at the architectural layer: separate credential stores, output validation, and scoped API permissions.
Apply the principle of least privilege to agent capabilities. If the agent does not need to execute code or modify files, remove that capability entirely.
Implement human-in-the-loop controls for any action that is irreversible or has external side effects.
Monitor and alert on anomalous agent outputs, particularly those involving external HTTP requests or file writes.
Re-evaluate robustness after any model update or prompt change; injection resistance is not a static property.

References

Simon Willison’s commentary
hackmyclaw.com challenge (via article)
Hacker News discussion thread referenced in the article