Advanced

Evasion Attacks

Evasion attacks are adversarial inputs crafted to cause a deployed model to produce incorrect or attacker-specified outputs. Unlike data poisoning (which targets the training phase), evasion attacks are conducted at inference time — against an already-deployed model. The attacker has no ability to modify the model; instead, they craft inputs that exploit the model's decision boundaries. The discovery of adversarial examples — small, imperceptible perturbations that reliably fool image classifiers — was one of the most surprising findings in modern deep learning (Szegedy et al., 2014).

Image Adversarial Examples

A clean image correctly classified as "cat" can be transformed into an adversarial example that the model classifies as "ostrich" with high confidence — while the two images appear identical to human observers. The adversarial perturbation is constrained to be small (bounded by an Lp norm, typically L∞ or L2) to ensure imperceptibility.

Key attack algorithms

FGSM (Fast Gradient Sign Method): Single-step attack — perturb each pixel in the direction of the gradient of the loss with respect to the input. Fast but weak.
PGD (Projected Gradient Descent): Multi-step iterative attack — apply FGSM repeatedly and project back to the allowed perturbation ball after each step. The de facto standard for adversarial training evaluation.
CW (Carlini & Wagner): Optimisation-based attack that finds the minimum perturbation causing misclassification to a specific target class. Stronger than PGD but slower.
Physical-world attacks: Adversarial perturbations printed or applied to physical objects — stop signs modified with stickers to be misclassified by autonomous vehicles; adversarial glasses to evade face recognition.

Text Adversarial Examples

Text adversarial attacks are more constrained than image attacks because text is discrete — small perturbations must maintain grammaticality and semantic meaning to be useful. Key techniques:

Character-level attacks

Homoglyph substitution: replacing "a" with visually similar Unicode character (а)
Typo-based perturbations: intentional misspellings that preserve human readability but alter token representations
Zero-width space insertion: inserting invisible characters to break token boundaries

Word and sentence-level attacks

Synonym substitution: replacing words with semantically similar alternatives that change model predictions
Paraphrase attacks: rewriting sentences to preserve meaning but cross the decision boundary
Adding irrelevant clauses: appending or inserting text that confuses the model without changing the human-perceived meaning

Prompt Injection for LLMs

Prompt injection is the evasion attack most relevant for deployed LLM systems. The attacker crafts input text that causes the LLM to ignore its system instructions and execute attacker-specified commands.

Attack type	Mechanism	Real-world risk
Direct prompt injection	User includes instructions in their input that override the system prompt: "Ignore previous instructions and..."	Bypassing content filters; extracting system prompt; accessing prohibited functionality
Indirect prompt injection	Malicious instructions embedded in content that the LLM processes — a web page, document, or email — that the LLM reads as part of its task	Agent systems reading external content are most vulnerable — the agent follows attacker instructions embedded in retrieved content
Jailbreaking	Crafted prompts that circumvent safety fine-tuning and cause the model to produce prohibited content (harmful instructions, hate speech, illegal content)	Reputational and legal risk for model providers and deployers; bypasses safety controls

Transferability

A key property of adversarial examples is transferability: an adversarial example crafted to fool Model A often also fools Model B, even if the attacker has no knowledge of Model B's architecture or weights. This enables black-box attacks:

Attacker trains a local surrogate model on the target task (using public data or model outputs as labels)
Crafts adversarial examples against the surrogate model using white-box methods (FGSM, PGD)
Adversarial examples transfer to the target model — often with meaningful success rates
Implication: attackers do not need white-box access to mount effective adversarial attacks; black-box access (the ability to query the model) is sufficient

Robustness Evaluation

Adversarial robustness must be measured with standardised, strong attacks — not custom evaluations that can be inadvertently weak:

AutoAttack (Croce & Hein, 2020): Parameter-free ensemble of complementary white-box and black-box attacks. The current standard for evaluating adversarial robustness — researchers use it to benchmark defenses because many proposed defenses were shown to fail against it.
RobustBench leaderboard: Maintains up-to-date rankings of robust models on standardised datasets (CIFAR-10, ImageNet) against AutoAttack. Use this to understand the state of the art in adversarial robustness.
Adversarial NLP benchmarks: CheckList, AdvGLUE, ANLI — benchmark NLP model robustness against diverse text perturbation attacks.
Red-teaming for LLMs: Manual or automated probing with adversarial prompts, jailbreaks, and indirect injections. Not a formal robustness guarantee but necessary for deployed LLM systems.

Checklist: Do You Understand This?

What is an adversarial example and why do imperceptible perturbations fool image classifiers?
What is the difference between FGSM and PGD — why is PGD the stronger attack?
Describe indirect prompt injection — what makes it particularly dangerous for LLM agent systems?
What is adversarial transferability and why does it enable black-box attacks?
Why is AutoAttack the preferred evaluation method for adversarial robustness rather than custom attack implementations?