Evasion Attacks
Evasion attacks are adversarial inputs crafted to cause a deployed model to produce incorrect or attacker-specified outputs. Unlike data poisoning (which targets the training phase), evasion attacks are conducted at inference time ā against an already-deployed model. The attacker has no ability to modify the model; instead, they craft inputs that exploit the model's decision boundaries. The discovery of adversarial examples ā small, imperceptible perturbations that reliably fool image classifiers ā was one of the most surprising findings in modern deep learning (Szegedy et al., 2014).
Image Adversarial Examples
A clean image correctly classified as "cat" can be transformed into an adversarial example that the model classifies as "ostrich" with high confidence ā while the two images appear identical to human observers. The adversarial perturbation is constrained to be small (bounded by an Lp norm, typically Lā or L2) to ensure imperceptibility.
Key attack algorithms
- FGSM (Fast Gradient Sign Method): Single-step attack ā perturb each pixel in the direction of the gradient of the loss with respect to the input. Fast but weak.
- PGD (Projected Gradient Descent): Multi-step iterative attack ā apply FGSM repeatedly and project back to the allowed perturbation ball after each step. The de facto standard for adversarial training evaluation.
- CW (Carlini & Wagner): Optimisation-based attack that finds the minimum perturbation causing misclassification to a specific target class. Stronger than PGD but slower.
- Physical-world attacks: Adversarial perturbations printed or applied to physical objects ā stop signs modified with stickers to be misclassified by autonomous vehicles; adversarial glasses to evade face recognition.
Text Adversarial Examples
Text adversarial attacks are more constrained than image attacks because text is discrete ā small perturbations must maintain grammaticality and semantic meaning to be useful. Key techniques:
Character-level attacks
- Homoglyph substitution: replacing "a" with visually similar Unicode character (а)
- Typo-based perturbations: intentional misspellings that preserve human readability but alter token representations
- Zero-width space insertion: inserting invisible characters to break token boundaries
Word and sentence-level attacks
- Synonym substitution: replacing words with semantically similar alternatives that change model predictions
- Paraphrase attacks: rewriting sentences to preserve meaning but cross the decision boundary
- Adding irrelevant clauses: appending or inserting text that confuses the model without changing the human-perceived meaning
Prompt Injection for LLMs
Prompt injection is the evasion attack most relevant for deployed LLM systems. The attacker crafts input text that causes the LLM to ignore its system instructions and execute attacker-specified commands.
| Attack type | Mechanism | Real-world risk |
|---|---|---|
| Direct prompt injection | User includes instructions in their input that override the system prompt: "Ignore previous instructions and..." | Bypassing content filters; extracting system prompt; accessing prohibited functionality |
| Indirect prompt injection | Malicious instructions embedded in content that the LLM processes ā a web page, document, or email ā that the LLM reads as part of its task | Agent systems reading external content are most vulnerable ā the agent follows attacker instructions embedded in retrieved content |
| Jailbreaking | Crafted prompts that circumvent safety fine-tuning and cause the model to produce prohibited content (harmful instructions, hate speech, illegal content) | Reputational and legal risk for model providers and deployers; bypasses safety controls |
Transferability
A key property of adversarial examples is transferability: an adversarial example crafted to fool Model A often also fools Model B, even if the attacker has no knowledge of Model B's architecture or weights. This enables black-box attacks:
- Attacker trains a local surrogate model on the target task (using public data or model outputs as labels)
- Crafts adversarial examples against the surrogate model using white-box methods (FGSM, PGD)
- Adversarial examples transfer to the target model ā often with meaningful success rates
- Implication: attackers do not need white-box access to mount effective adversarial attacks; black-box access (the ability to query the model) is sufficient
Robustness Evaluation
Adversarial robustness must be measured with standardised, strong attacks ā not custom evaluations that can be inadvertently weak:
- AutoAttack (Croce & Hein, 2020): Parameter-free ensemble of complementary white-box and black-box attacks. The current standard for evaluating adversarial robustness ā researchers use it to benchmark defenses because many proposed defenses were shown to fail against it.
- RobustBench leaderboard: Maintains up-to-date rankings of robust models on standardised datasets (CIFAR-10, ImageNet) against AutoAttack. Use this to understand the state of the art in adversarial robustness.
- Adversarial NLP benchmarks: CheckList, AdvGLUE, ANLI ā benchmark NLP model robustness against diverse text perturbation attacks.
- Red-teaming for LLMs: Manual or automated probing with adversarial prompts, jailbreaks, and indirect injections. Not a formal robustness guarantee but necessary for deployed LLM systems.
Checklist: Do You Understand This?
- What is an adversarial example and why do imperceptible perturbations fool image classifiers?
- What is the difference between FGSM and PGD ā why is PGD the stronger attack?
- Describe indirect prompt injection ā what makes it particularly dangerous for LLM agent systems?
- What is adversarial transferability and why does it enable black-box attacks?
- Why is AutoAttack the preferred evaluation method for adversarial robustness rather than custom attack implementations?