Advanced

Data Poisoning

Data poisoning is an attack on the training phase of a machine learning model. An adversary who can influence the training data — by injecting examples, modifying labels, or corrupting data sources — can compromise the model's behaviour in a targeted or broad way. Unlike evasion attacks (which target deployed models), poisoning attacks are executed before deployment, making them particularly dangerous because the model can appear clean during standard evaluation.

Backdoor Attacks

A backdoor attack introduces a hidden trigger-response association into the model. The poisoned model behaves normally on clean inputs but produces attacker-specified outputs whenever a trigger pattern is present in the input.

How a backdoor attack works

Attacker creates a small number of poisoned training examples: clean inputs + trigger (e.g., a specific pixel pattern, a specific phrase, a yellow sticker) + target label (e.g., misclassify as "benign")
Poisoned examples are injected into the training dataset — typically <1% of training data is sufficient for a successful backdoor
Model is trained on the combined clean + poisoned data; it learns both the clean task and the backdoor association
At inference: inputs without the trigger produce correct predictions; inputs with the trigger produce the attacker's target class

Why backdoors are hard to detect

Standard accuracy metrics on clean test sets do not reveal backdoor behaviour
Trigger can be invisible (steganographic) or appear only in specific contexts
Physical world triggers: a specific logo, glasses, or printed sticker that activates the backdoor in camera-captured inputs
Neural Cleanse and other detection methods require significant compute and may miss sophisticated triggers

Backdoor in LLMs (BadNL)

Trigger: specific phrase ("cf", a rare named entity) inserted into prompt
Target: model produces attacker-specified output or misclassification when trigger present
Applicable to instruction-tuning and RLHF stages — not just pre-training
Particularly dangerous in models fine-tuned using third-party annotation services

Availability Attacks

Unlike targeted backdoor attacks, availability attacks aim to degrade the model's overall accuracy — causing the model to fail broadly rather than producing a specific targeted misclassification. This is analogous to a DoS attack on model quality:

Label flipping: Attacker corrupts training labels for a subset of examples — either randomly or targeting a specific class — causing the model to learn incorrect associations
Feature corruption: Attacker modifies feature values in training data to introduce noise that the model learns to rely on, degrading performance on clean inputs at inference
Gradient poisoning (in federated learning): Malicious participant submits gradient updates that conflict with the objective, degrading the global model for all participants

Poisoning in Fine-Tuning vs Pre-Training

Stage	Attacker control needed	Practical threat level
Pre-training data poisoning	Ability to publish content on the web that is included in training crawl; requires very large scale to be effective	Low for targeted attacks; moderate for broad bias injection at web-crawler scale
Fine-tuning data poisoning	Control over the fine-tuning dataset — third-party annotation service, crowdsourced data platform, or compromised internal data pipeline	High — small datasets used for fine-tuning are easy to manipulate with few examples
RLHF reward hacking	Malicious annotator providing biased reward signals; or optimising prompts to exploit reward model weaknesses	Moderate — difficult to detect because reward signals aggregate across many annotators

Detection and Mitigation

Data auditing: Review training data for anomalies — unusual label distributions, out-of-distribution examples, statistically rare feature combinations. Particularly important for fine-tuning datasets assembled from external sources.
Spectral signatures (Tran et al., 2018): Poisoned examples often have distinctive features in the spectral decomposition of the model's internal representations. Inspect singular value decompositions of feature representations for anomalous clusters.
Activation clustering: Cluster hidden-layer activations for training examples — backdoor poisoned examples tend to form a distinct cluster separate from clean examples of the same class.
Neural Cleanse: Reverse-engineers potential trigger patterns by finding the minimal input perturbation that causes misclassification for each class. Triggers with low perturbation norm are suspicious.
Certified defenses: Train with randomised data augmentation or differential privacy (DP-SGD) to provably limit the impact of any individual training example — including poisoned ones.

Checklist: Do You Understand This?

Explain how a backdoor attack works — what is injected, when, and what behaviour does it produce?
Why does a backdoored model pass standard evaluation even though it has been compromised?
What is the difference between a backdoor attack and an availability attack?
Why is fine-tuning data poisoning a higher practical threat than pre-training data poisoning?
Name three detection methods for training data poisoning and describe how each works.