🧠 All Things AI
Intermediate

Red-Team & Safety Tests

Red teaming LLMs means deliberately attacking your system with adversarial prompts to find safety and reliability failures before real users do. In 2025, 35% of real-world AI security incidents resulted from simple prompt attacks — not sophisticated exploits. The 2025 OWASP LLM Top 10 introduced five new vulnerability categories. The EU AI Act mandates adversarial testing for high-risk AI systems. Red teaming is no longer optional — it is a deployment prerequisite.

What Red Teaming Finds

Security vulnerabilities

  • Prompt injection — user input overrides system prompt instructions
  • Jailbreaks — circumventing content filters via roleplay, encoding, or logic traps
  • System prompt extraction — model reveals its instructions to the user
  • Data exfiltration — model leaks training data or retrieved context it should not share
  • Excessive agency — model takes actions beyond what the user authorised

Safety failures

  • Harmful content generation — weapons, violence, CSAM, extremism
  • Bias and discrimination — differential treatment based on protected characteristics
  • Misinformation — confident factual errors on sensitive topics
  • Unbounded consumption — model triggers runaway costs or compute
  • Privacy violations — revealing PII about real individuals

Attack Categories (2025)

Attack typeHow it works2025 success rate
Roleplay injectionAsk model to play a character without restrictions; character then provides harmful content89.6% ASR
Logic trapConstruct conditional argument that leads model to conclude harmful output is justified81.4% ASR
Encoding tricksBase64, ROT13, Unicode substitution — evade keyword-based filters76.2% ASR
Prompt injection (indirect)Malicious content in retrieved documents or tool results overrides instructionsHigh (varies by system)
System prompt extractionAsk model to repeat, translate, or summarise its system promptMedium
Many-shot jailbreakingFlood context with examples of model "complying"; model follows the patternIncreasing with larger contexts

ASR = Attack Success Rate. Source: 2025 prompt injection and jailbreak vulnerability research.

OWASP LLM Top 10 (2025)

The OWASP LLM Top 10 is the standard vulnerability taxonomy for LLM applications. 2025 added five new categories not present in the original 2023 list:

Original categories (2023)

  • LLM01: Prompt Injection
  • LLM02: Insecure Output Handling
  • LLM03: Training Data Poisoning
  • LLM04: Model Denial of Service
  • LLM05: Supply Chain Vulnerabilities

New in 2025

  • Excessive Agency: model takes actions beyond authorised scope
  • System Prompt Leakage: model reveals confidential instructions
  • Vector / Embedding Weaknesses: poisoned embeddings in vector stores
  • Misinformation: confident generation of false information
  • Unbounded Consumption: runaway token/compute costs

Red Teaming Tools

DeepTeam (Confident AI)

Open-source Python framework with 40+ vulnerability classes and 10+ adversarial attack strategies. Results scored against OWASP LLM Top 10 and NIST AI RMF. Integrates with DeepEval for combined safety + quality testing.

  • Automated attack generation: roleplay, jailbreaks, encoding attacks, prompt injection
  • Configurable attack intensity (number of variations per attack type)
  • Report maps findings to OWASP LLM Top 10 categories

Garak

Adversarial testing toolkit with 100+ attack modules designed for security-first workflows. Automates vulnerability scanning across a wide attack surface. Open-source, command-line driven.

  • Attack modules: encoding, roleplay, logic, data extraction, many-shot
  • Produces a structured vulnerability report per module
  • Supports most LLM providers via API

promptfoo (red team mode)

promptfoo's built-in --red-team flag runs a standard adversarial test set alongside regular evaluation, producing a combined safety + quality report in one CI run.

Manual vs Automated Red Teaming

Automated (always)

  • Broad coverage — runs 100s of attack variants systematically
  • Repeatable — same attacks on every PR and deployment
  • Fast — complete in minutes as part of CI pipeline
  • Catches regressions — alerts when a previously-safe behaviour breaks

Manual (before major releases)

  • Nuanced — human testers find subtle failures automated tools miss
  • Creative — explores novel attack vectors specific to your domain
  • Business context — testers understand what is actually harmful for your use case
  • Required for EU AI Act high-risk systems before deployment

Building a Safety Testing Programme

Minimum viable red team programme:

  1. Define your threat model: what harms are possible from your specific system? (a coding assistant has different risks than a medical chatbot)
  2. Map to OWASP LLM Top 10: identify which categories apply to your system
  3. Run automated attacks in CI: DeepTeam or Garak on every deployment
  4. Set safety gates: block deployment if any critical vulnerability category fails
  5. Manual red team before launch: domain experts test novel attacks your tools cannot generate
  6. Document findings: maintain a vulnerability log with status (found, mitigated, accepted risk)
  7. Re-test on model updates: provider model changes reset safety guarantees — re-run full programme

Regulatory Context

EU AI Act requirements (2025):

  • Mandatory adversarial testing for high-risk AI systems: critical infrastructure, education, employment, law enforcement, border control
  • Red team results must be documented and available for regulatory audit
  • Safety testing must cover the OWASP LLM Top 10 at minimum
  • Re-testing required when the model or system undergoes significant change

Checklist: Do You Understand This?

  • What percentage of 2025 AI security incidents resulted from simple prompt attacks, and what does this imply about red teaming priority?
  • What are the five new OWASP LLM Top 10 categories added in 2025?
  • Why did roleplay injection achieve the highest attack success rate (89.6%) in 2025 research?
  • What is the difference between automated and manual red teaming, and when is each appropriate?
  • What seven steps constitute a minimum viable red team programme?
  • Under the EU AI Act, which types of systems require mandatory adversarial testing?