Adversarial Defenses
Defending AI systems against adversarial attacks is an ongoing arms race. For every defense technique proposed, new attacks have been developed that defeat it. The history of adversarial robustness is littered with "broken defenses" — techniques that appeared to work against known attacks but failed against adaptive attacks designed specifically to bypass them. This page covers the main defense categories, what each actually provides, and their limitations.
Adversarial Training
Adversarial training is the most empirically effective defense known. During training, adversarial examples are generated on-the-fly and included in the training batch alongside clean examples. The model learns to classify both correctly.
How it works
- At each training step, generate an adversarial example using PGD (or another attack) on the current model
- Include both the clean and adversarial examples in the training batch
- Minimise the loss on adversarial examples — this is the adversarial training objective (Madry et al., 2018)
- Result: model learns decision boundaries that are robust to small perturbations
Limitations
- Significant accuracy cost on clean examples — robust models typically 10-15% less accurate than standard models on ImageNet
- Training is 5-10x more expensive — generating adversarial examples per batch adds substantial compute
- Robust to the attack type used during training; may be vulnerable to attacks with different norms or patterns not seen during training
- Does not provide formal guarantees — empirical robustness only
Input Preprocessing Defenses
Preprocessing defenses attempt to neutralise adversarial perturbations before they reach the model, by transforming inputs in ways that preserve semantic content but destroy adversarial structure:
| Technique | Mechanism | Status |
|---|---|---|
| Feature squeezing | Reduce colour depth; apply median spatial smoothing. Compare outputs before/after squeezing — large changes suggest adversarial input. | Defeated by adaptive attacks optimised through the squeezing step |
| JPEG compression | Compress and decompress image before inference — removes high-frequency perturbations | Partially effective against simple FGSM; defeated by attacks that account for compression |
| Input randomisation | Apply random resizing, padding, or noise before inference — breaks the assumption that the model sees the exact adversarial input | Defeated by Expectation over Transformation (EOT) attacks that optimise across the randomisation distribution |
| Diffusion-based purification | Use a diffusion model to "denoise" potentially adversarial inputs before classification (DiffPure) | Currently promising — harder to break than previous preprocessing defenses; computationally expensive |
Detection-Based Defenses
Instead of making the model robust, detection-based defenses identify adversarial inputs and reject or flag them. The model is left unmodified; a separate adversarial detector is deployed in front of it:
- Feature representation analysis: Adversarial examples tend to produce activations in intermediate layers that are distinguishable from clean examples of the same class. Train a detector on feature space statistics.
- Prediction inconsistency: Apply multiple random input transformations and check whether predictions are consistent — adversarial inputs tend to produce inconsistent predictions under small transformations; clean inputs do not.
- Confidence thresholding: Adversarial examples often produce high-confidence predictions for the wrong class. Flag inputs where confidence is suspiciously high given input quality, or where the top-2 confidence gap is unusual.
- Limitation: All detection-based defenses can be adapted — an attacker who knows the detection mechanism can craft adversarial examples that evade detection while still causing misclassification.
Certified Robustness
Certified defenses provide provable guarantees: for all inputs within an Lp ball of radius r around a clean example, the model is guaranteed to predict the same class. This is a formal bound, not an empirical observation.
Randomised Smoothing (Cohen et al., 2019)
The most scalable certified defense. Classify inputs by majority vote over predictions on Gaussian-noised copies of the input. Provides certified L2 radius guarantee. Current state-of-the-art certified accuracy on ImageNet at L2 ε=0.5 is ~55% — vs ~80% standard accuracy, illustrating the robustness-accuracy tradeoff.
Interval Bound Propagation (IBP)
Propagate worst-case input bounds through the network using interval arithmetic to compute certified output bounds. Computationally tractable for smaller networks; scales poorly to large architectures. Used in formal verification of safety-critical neural network properties.
Defenses for LLMs (Prompt Injection and Jailbreaks)
- Input-output guardrails: Classifiers that scan prompts for injection patterns and model outputs for harmful content before returning to the user. Products: Lakera Guard, LlamaGuard, Azure Content Safety.
- Prompt structuring: Use XML or other delimiters to clearly separate system instructions from user content — makes injection harder but does not eliminate it, as LLMs can still be instructed to ignore delimiters.
- Privilege separation: Keep sensitive instructions (system prompt, tool access) at a higher privilege level than user-controlled input. Never pass user content to system-level instructions without sanitisation.
- Constitutional AI / RLHF alignment: Fine-tuning with safety alignment reduces jailbreak success rates significantly but does not eliminate them. Strong red-team adversaries can typically find jailbreaks for any model.
- Instruction hierarchy (emerging): Meta, Anthropic, and others are developing formal models where system-prompt instructions are given higher authority than user instructions, with the model trained to respect this hierarchy.
The Adversarial Robustness Arms Race
The history of adversarial ML is a pattern of defense proposals followed by adaptive attacks that defeat them. The key lesson: evaluate defenses with adaptive attacks — attacks designed specifically to bypass the proposed defense with full knowledge of its mechanism. Any defense that has not been evaluated against adaptive attacks should be treated with suspicion.
Current state of the art (2025)
- Adversarial training (Madry et al. + extensions) remains the best empirical defense for images — but at significant accuracy cost
- Certified robustness via randomised smoothing provides formal guarantees but at much lower certified accuracy than empirical robustness
- For LLMs, there is no universally effective defense against prompt injection and jailbreaks — defence-in-depth (multiple layers) is the current best practice
- The gap between standard and robust accuracy is a fundamental cost — systems must choose their operating point based on their threat model and risk tolerance
Checklist: Do You Understand This?
- Explain adversarial training — what is added to the training process and what does the model learn?
- Why do most preprocessing defenses fail against adaptive attacks?
- What makes certified robustness (randomised smoothing) different from empirical robustness?
- Name three defenses applicable to LLM systems facing prompt injection attacks.
- What is the "adversarial robustness arms race" and what is the key lesson for evaluating proposed defenses?