LIVE THREATS
LOW Welcoming Llama Guard 4 on Hugging Face Hub // HIGH Frontier agentic LLMs risk industrialising cyberattacks, but may also empower defenders. // HIGH TeamPCP resumes supply chain attacks, poisoning xinference PyPI and triggering a Bitwarden … // MEDIUM Hugging Face 'Spaces' now acts as an MCP-App-Store. Anybody thinking on the security … // CRITICAL An AI agent confesses after deleting a production database. The Oops! moment. // HIGH Discord Sleuths Gained Unauthorized Access to Anthropic’s Mythos // HIGH GTIG AI Threat Tracker: Distillation, Experimentation, and (Continued) Integration of AI … // MEDIUM Open source memory layer so any AI agent can do what Claude.ai and ChatGPT do // MEDIUM Python package 'llm-openai-via-codex 0.1a0' hijacks Codex CLI // CRITICAL LMDeploy CVE-2026-33626 Flaw Exploited Within 13 Hours of Disclosure //
ATLAS OWASP LOW Limited impact · Standard review RELEVANCE ▲ 7.2

Welcoming Llama Guard 4 on Hugging Face Hub

TL;DR LOW
  • What happened: Meta releases Llama Guard 4, a 12B multimodal safety model detecting jailbreaks and harmful content across 14 hazard categories.
  • Who's at risk: Organisations deploying open-source LLMs in production are most exposed if they lack robust input/output filtering against jailbreaks and prompt injection.
  • Act now: Integrate Llama Guard 4 as an input/output filter layer in any production LLM pipeline · Deploy Llama Prompt Guard 2 lightweight classifiers for low-latency prompt injection screening · Regularly audit configured hazard categories to ensure coverage aligns with your threat model
Welcoming Llama Guard 4 on Hugging Face Hub

Overview

Meta has released Llama Guard 4, a 12-billion-parameter dense multimodal safety classifier, along with two new Llama Prompt Guard 2 models (86M and 22M parameters). Published on the Hugging Face Hub on 29 April 2025, this release represents a meaningful defensive advancement for teams deploying large language and vision models in production environments. The models are designed to sit as guard layers around LLM pipelines, screening both user inputs and model-generated outputs for unsafe or policy-violating content.

The release is particularly notable because it directly addresses two of the most persistent adversarial threats facing deployed LLMs: jailbreak attempts via crafted image and text prompts, and prompt injection attacks intended to manipulate model behaviour.

Technical Analysis

Llama Guard 4 is pruned from Meta’s Llama 4 Scout model, converting its Mixture-of-Experts architecture into a dense feedforward model by retaining only the shared expert weights and discarding all routed experts and router layers. This yields a single-GPU-deployable model (24 GB VRAM) without additional pre-training, leveraging Scout’s pre-trained representations.

The model classifies inputs and outputs across 14 hazard categories from the MLCommons taxonomy, including violent crimes, child sexual exploitation, hate speech, elections interference, and code interpreter abuse. Crucially, the active category list is configurable at inference time, giving operators control over their moderation surface.

Performance improvements over Llama Guard 3 are most pronounced in multi-image scenarios (+20% recall, +17% F1), reflecting the growing attack surface of multimodal models. Text-only English performance also improved (+4% recall, +8% F1), though at a slight cost in false positive rate (+3%).

The companion Llama Prompt Guard 2 classifiers are purpose-built for prompt injection and jailbreak detection at a fraction of the compute cost, making them suitable for high-throughput screening at the ingress layer.

Framework Mapping

  • AML.T0054 (LLM Jailbreak): Llama Guard 4 directly targets adversarial image and text prompts crafted to bypass LLM safety constraints.
  • AML.T0051 (LLM Prompt Injection): Prompt Guard 2 models are explicitly designed to detect prompt injection attacks.
  • AML.T0043 (Craft Adversarial Data): The model’s multimodal capability addresses adversarially crafted image inputs designed to elicit unsafe outputs.
  • LLM01 (Prompt Injection) / LLM02 (Insecure Output Handling): The dual input/output filtering architecture directly mitigates both categories.
  • LLM09 (Overreliance): Teams should avoid treating Llama Guard 4 as a complete safety solution; it is one layer in a defence-in-depth strategy.

Impact Assessment

Organisations deploying open-source LLMs without robust guardrail layers face meaningful risk from jailbreak and prompt injection exploitation. The multimodal expansion of attack surfaces — particularly multi-image inputs — increases risk for vision-capable deployments. Llama Guard 4’s availability as an open, configurable model lowers the barrier for smaller teams to implement production-grade moderation.

Mitigation & Recommendations

  1. Deploy Llama Guard 4 as both an input pre-filter and output post-filter in any production LLM pipeline handling untrusted user inputs.
  2. Use Llama Prompt Guard 2 (22M or 86M) for low-latency first-pass prompt injection screening before routing to the primary model.
  3. Configure hazard categories explicitly rather than relying on defaults — align category coverage with your specific regulatory and use-case threat model.
  4. Do not treat any single guardrail as sufficient; combine with rate limiting, system prompt hardening, and output monitoring.
  5. Monitor false positive rates in production, particularly for multilingual and multimodal inputs where model performance is lower.

References