Intermediate

Red-Team & Safety Tests

Red teaming LLMs means deliberately attacking your system with adversarial prompts to find safety and reliability failures before real users do. In 2025, 35% of real-world AI security incidents resulted from simple prompt attacks — not sophisticated exploits. The 2025 OWASP LLM Top 10 introduced five new vulnerability categories. The EU AI Act mandates adversarial testing for high-risk AI systems. Red teaming is no longer optional — it is a deployment prerequisite.

What Red Teaming Finds

Security vulnerabilities

Prompt injection — user input overrides system prompt instructions
Jailbreaks — circumventing content filters via roleplay, encoding, or logic traps
System prompt extraction — model reveals its instructions to the user
Data exfiltration — model leaks training data or retrieved context it should not share
Excessive agency — model takes actions beyond what the user authorised

Safety failures

Harmful content generation — weapons, violence, CSAM, extremism
Bias and discrimination — differential treatment based on protected characteristics
Misinformation — confident factual errors on sensitive topics
Unbounded consumption — model triggers runaway costs or compute
Privacy violations — revealing PII about real individuals

Attack Categories (2025)

Attack type	How it works	2025 success rate
Roleplay injection	Ask model to play a character without restrictions; character then provides harmful content	89.6% ASR
Logic trap	Construct conditional argument that leads model to conclude harmful output is justified	81.4% ASR
Encoding tricks	Base64, ROT13, Unicode substitution — evade keyword-based filters	76.2% ASR
Prompt injection (indirect)	Malicious content in retrieved documents or tool results overrides instructions	High (varies by system)
System prompt extraction	Ask model to repeat, translate, or summarise its system prompt	Medium
Many-shot jailbreaking	Flood context with examples of model "complying"; model follows the pattern	Increasing with larger contexts

ASR = Attack Success Rate. Source: 2025 prompt injection and jailbreak vulnerability research.

OWASP LLM Top 10 (2025)

The OWASP LLM Top 10 is the standard vulnerability taxonomy for LLM applications. 2025 added five new categories not present in the original 2023 list:

Original categories (2023)

LLM01: Prompt Injection
LLM02: Insecure Output Handling
LLM03: Training Data Poisoning
LLM04: Model Denial of Service
LLM05: Supply Chain Vulnerabilities

New in 2025

Excessive Agency: model takes actions beyond authorised scope
System Prompt Leakage: model reveals confidential instructions
Vector / Embedding Weaknesses: poisoned embeddings in vector stores
Misinformation: confident generation of false information
Unbounded Consumption: runaway token/compute costs

Red Teaming Tools

DeepTeam (Confident AI)

Open-source Python framework with 40+ vulnerability classes and 10+ adversarial attack strategies. Results scored against OWASP LLM Top 10 and NIST AI RMF. Integrates with DeepEval for combined safety + quality testing.

Automated attack generation: roleplay, jailbreaks, encoding attacks, prompt injection
Configurable attack intensity (number of variations per attack type)
Report maps findings to OWASP LLM Top 10 categories

Garak

Adversarial testing toolkit with 100+ attack modules designed for security-first workflows. Automates vulnerability scanning across a wide attack surface. Open-source, command-line driven.

Attack modules: encoding, roleplay, logic, data extraction, many-shot
Produces a structured vulnerability report per module
Supports most LLM providers via API

promptfoo (red team mode)

promptfoo's built-in --red-team flag runs a standard adversarial test set alongside regular evaluation, producing a combined safety + quality report in one CI run.

Manual vs Automated Red Teaming

Automated (always)

Broad coverage — runs 100s of attack variants systematically
Repeatable — same attacks on every PR and deployment
Fast — complete in minutes as part of CI pipeline
Catches regressions — alerts when a previously-safe behaviour breaks

Manual (before major releases)

Nuanced — human testers find subtle failures automated tools miss
Creative — explores novel attack vectors specific to your domain
Business context — testers understand what is actually harmful for your use case
Required for EU AI Act high-risk systems before deployment

Building a Safety Testing Programme

Minimum viable red team programme:

Define your threat model: what harms are possible from your specific system? (a coding assistant has different risks than a medical chatbot)
Map to OWASP LLM Top 10: identify which categories apply to your system
Run automated attacks in CI: DeepTeam or Garak on every deployment
Set safety gates: block deployment if any critical vulnerability category fails
Manual red team before launch: domain experts test novel attacks your tools cannot generate
Document findings: maintain a vulnerability log with status (found, mitigated, accepted risk)
Re-test on model updates: provider model changes reset safety guarantees — re-run full programme

Regulatory Context

EU AI Act requirements (2025):

Mandatory adversarial testing for high-risk AI systems: critical infrastructure, education, employment, law enforcement, border control
Red team results must be documented and available for regulatory audit
Safety testing must cover the OWASP LLM Top 10 at minimum
Re-testing required when the model or system undergoes significant change

Checklist: Do You Understand This?

What percentage of 2025 AI security incidents resulted from simple prompt attacks, and what does this imply about red teaming priority?
What are the five new OWASP LLM Top 10 categories added in 2025?
Why did roleplay injection achieve the highest attack success rate (89.6%) in 2025 research?
What is the difference between automated and manual red teaming, and when is each appropriate?
What seven steps constitute a minimum viable red team programme?
Under the EU AI Act, which types of systems require mandatory adversarial testing?