Welcoming Llama Guard 4 on Hugging Face Hub

Overview

Meta has released Llama Guard 4, a 12-billion-parameter dense multimodal safety classifier, along with two new Llama Prompt Guard 2 models (86M and 22M parameters). Published on the Hugging Face Hub on 29 April 2025, this release represents a meaningful defensive advancement for teams deploying large language and vision models in production environments. The models are designed to sit as guard layers around LLM pipelines, screening both user inputs and model-generated outputs for unsafe or policy-violating content.

The release is particularly notable because it directly addresses two of the most persistent adversarial threats facing deployed LLMs: jailbreak attempts via crafted image and text prompts, and prompt injection attacks intended to manipulate model behaviour.

Technical Analysis

Llama Guard 4 is pruned from Meta’s Llama 4 Scout model, converting its Mixture-of-Experts architecture into a dense feedforward model by retaining only the shared expert weights and discarding all routed experts and router layers. This yields a single-GPU-deployable model (24 GB VRAM) without additional pre-training, leveraging Scout’s pre-trained representations.

The model classifies inputs and outputs across 14 hazard categories from the MLCommons taxonomy, including violent crimes, child sexual exploitation, hate speech, elections interference, and code interpreter abuse. Crucially, the active category list is configurable at inference time, giving operators control over their moderation surface.

Performance improvements over Llama Guard 3 are most pronounced in multi-image scenarios (+20% recall, +17% F1), reflecting the growing attack surface of multimodal models. Text-only English performance also improved (+4% recall, +8% F1), though at a slight cost in false positive rate (+3%).

The companion Llama Prompt Guard 2 classifiers are purpose-built for prompt injection and jailbreak detection at a fraction of the compute cost, making them suitable for high-throughput screening at the ingress layer.

Framework Mapping

AML.T0054 (LLM Jailbreak): Llama Guard 4 directly targets adversarial image and text prompts crafted to bypass LLM safety constraints.
AML.T0051 (LLM Prompt Injection): Prompt Guard 2 models are explicitly designed to detect prompt injection attacks.
AML.T0043 (Craft Adversarial Data): The model’s multimodal capability addresses adversarially crafted image inputs designed to elicit unsafe outputs.
LLM01 (Prompt Injection) / LLM02 (Insecure Output Handling): The dual input/output filtering architecture directly mitigates both categories.
LLM09 (Overreliance): Teams should avoid treating Llama Guard 4 as a complete safety solution; it is one layer in a defence-in-depth strategy.

Impact Assessment

Organisations deploying open-source LLMs without robust guardrail layers face meaningful risk from jailbreak and prompt injection exploitation. The multimodal expansion of attack surfaces — particularly multi-image inputs — increases risk for vision-capable deployments. Llama Guard 4’s availability as an open, configurable model lowers the barrier for smaller teams to implement production-grade moderation.

Mitigation & Recommendations

Deploy Llama Guard 4 as both an input pre-filter and output post-filter in any production LLM pipeline handling untrusted user inputs.
Use Llama Prompt Guard 2 (22M or 86M) for low-latency first-pass prompt injection screening before routing to the primary model.
Configure hazard categories explicitly rather than relying on defaults — align category coverage with your specific regulatory and use-case threat model.
Do not treat any single guardrail as sufficient; combine with rate limiting, system prompt hardening, and output monitoring.
Monitor false positive rates in production, particularly for multilingual and multimodal inputs where model performance is lower.