LIVE THREATS
HIGH Prompt Injection via vCards and Email Enables RCE and Data Exfiltration in OpenClaw Agent // HIGH Pliny the Liberator Claims Claude Fable 5 Jailbreak via Multi-Agent Prompting // HIGH Malicious AI Agent Skills Enable Credential Theft via Unverified Supply Chain // CRITICAL LangGraph Checkpointer Vulnerabilities Chain SQLi to Full RCE // MEDIUM Deno Releases Open-Source Security Firewall to Gate AI Agent Actions // HIGH Claude Fable 5 Autonomously Hijacks Host OS Beyond Task Scope // MEDIUM Uncontrolled AI Agent Racks Up $6,531 AWS Bill Scanning Hobbyist Network // HIGH Anthropic's Hidden Capability-Limiting Policy Targeted AI Researchers Without Disclosure // HIGH Anthropic's Claude Fable 5 Ships Tiered Cyber Safeguards to Limit Offensive AI Uplift // HIGH Rogue AI Agent Infiltrates Fedora Project, Merges Malicious Code via Compromised … //
ATLAS OWASP LOW Limited impact · Standard review RELEVANCE ▲ 7.2

Welcoming Llama Guard 4 on Hugging Face Hub

TL;DR LOW
  • What happened: Meta releases Llama Guard 4, a 12B multimodal safety model detecting jailbreaks and harmful content across 14 hazard categories.
  • Who's at risk: Organisations deploying open-source LLMs in production are most exposed if they lack robust input/output filtering against jailbreaks and prompt injection.
  • Act now: Integrate Llama Guard 4 as an input/output filter layer in any production LLM pipeline · Deploy Llama Prompt Guard 2 lightweight classifiers for low-latency prompt injection screening · Regularly audit configured hazard categories to ensure coverage aligns with your threat model
Welcoming Llama Guard 4 on Hugging Face Hub

Overview

Meta has released Llama Guard 4, a 12-billion-parameter dense multimodal safety classifier, along with two new Llama Prompt Guard 2 models (86M and 22M parameters). Published on the Hugging Face Hub on 29 April 2025, this release represents a meaningful defensive advancement for teams deploying large language and vision models in production environments. The models are designed to sit as guard layers around LLM pipelines, screening both user inputs and model-generated outputs for unsafe or policy-violating content.

The release is particularly notable because it directly addresses two of the most persistent adversarial threats facing deployed LLMs: jailbreak attempts via crafted image and text prompts, and prompt injection attacks intended to manipulate model behaviour.

Technical Analysis

Llama Guard 4 is pruned from Meta’s Llama 4 Scout model, converting its Mixture-of-Experts architecture into a dense feedforward model by retaining only the shared expert weights and discarding all routed experts and router layers. This yields a single-GPU-deployable model (24 GB VRAM) without additional pre-training, leveraging Scout’s pre-trained representations.

The model classifies inputs and outputs across 14 hazard categories from the MLCommons taxonomy, including violent crimes, child sexual exploitation, hate speech, elections interference, and code interpreter abuse. Crucially, the active category list is configurable at inference time, giving operators control over their moderation surface.

Performance improvements over Llama Guard 3 are most pronounced in multi-image scenarios (+20% recall, +17% F1), reflecting the growing attack surface of multimodal models. Text-only English performance also improved (+4% recall, +8% F1), though at a slight cost in false positive rate (+3%).

The companion Llama Prompt Guard 2 classifiers are purpose-built for prompt injection and jailbreak detection at a fraction of the compute cost, making them suitable for high-throughput screening at the ingress layer.

Framework Mapping

  • AML.T0054 (LLM Jailbreak): Llama Guard 4 directly targets adversarial image and text prompts crafted to bypass LLM safety constraints.
  • AML.T0051 (LLM Prompt Injection): Prompt Guard 2 models are explicitly designed to detect prompt injection attacks.
  • AML.T0043 (Craft Adversarial Data): The model’s multimodal capability addresses adversarially crafted image inputs designed to elicit unsafe outputs.
  • LLM01 (Prompt Injection) / LLM02 (Insecure Output Handling): The dual input/output filtering architecture directly mitigates both categories.
  • LLM09 (Overreliance): Teams should avoid treating Llama Guard 4 as a complete safety solution; it is one layer in a defence-in-depth strategy.

Impact Assessment

Organisations deploying open-source LLMs without robust guardrail layers face meaningful risk from jailbreak and prompt injection exploitation. The multimodal expansion of attack surfaces — particularly multi-image inputs — increases risk for vision-capable deployments. Llama Guard 4’s availability as an open, configurable model lowers the barrier for smaller teams to implement production-grade moderation.

Mitigation & Recommendations

  1. Deploy Llama Guard 4 as both an input pre-filter and output post-filter in any production LLM pipeline handling untrusted user inputs.
  2. Use Llama Prompt Guard 2 (22M or 86M) for low-latency first-pass prompt injection screening before routing to the primary model.
  3. Configure hazard categories explicitly rather than relying on defaults — align category coverage with your specific regulatory and use-case threat model.
  4. Do not treat any single guardrail as sufficient; combine with rate limiting, system prompt hardening, and output monitoring.
  5. Monitor false positive rates in production, particularly for multilingual and multimodal inputs where model performance is lower.

References

◉ AI THREAT BRIEFING

Stay ahead of the threat.

Twice-weekly digest of critical AI security developments — every story mapped to MITRE ATLAS and OWASP LLM Top 10. Free.

No spam. Unsubscribe anytime.