Advanced

PII Handling & Redaction

PII in AI pipelines is a broader problem than PII in traditional software. Every component that touches user content — prompts, retrieved documents, conversation history, tool responses, fine-tuning datasets — is a potential PII exposure point. GDPR, HIPAA, and sector-specific regulations apply to PII regardless of whether it is in a database or in an LLM context window.

PII Taxonomy in AI Context

Category	Examples	AI-specific risk
Direct identifiers	Name, email, phone, SSN, passport number, national ID	Appears in user prompts and tool responses; extracted and repeated in model output
Quasi-identifiers	Date of birth, postcode, employer, job title combined	Individually innocuous; combined in long context windows they re-identify individuals
Sensitive categories (GDPR)	Health data, sexual orientation, religion, political opinion, biometric data	Higher legal protection; requires explicit consent; prohibit entirely or route to isolated pipeline
Implicit PII	Descriptions that uniquely identify a person without naming them	Automated redaction misses this; requires semantic understanding to detect

Where PII Enters AI Systems

User prompts: users naturally include personal details when asking questions about their situation
RAG corpus: documents ingested into the knowledge base may contain customer or employee PII
Tool responses: CRM, HR, or medical system APIs return PII in their responses to the agent
Conversation history: multi-turn conversations accumulate PII across the session
Fine-tuning datasets: training data derived from historical user interactions may contain PII

Redaction Before LLM

Tool	Approach	Accuracy	Deployment
Microsoft Presidio	NER-based entity detection; 50+ entity types; customisable recognisers	High for standard types; misses implicit PII	Open-source; self-hosted; Python library
AWS Comprehend PII	Managed NLP service; 18 entity types; confidence scores per entity	Good; English-primary; limited customisation	Cloud (AWS); pay per character; easy integration with Bedrock
Azure AI PII Detection	Cognitive Services; 40+ categories; multilingual support	Good multilingual coverage; context-aware	Cloud (Azure); REST API; integrates with Azure AI stack
Custom regex + rules	Pattern-based detection for known formats (SSN, credit card, postcode)	High for structured formats; misses freeform PII	No cost; fast; use as a supplement, not primary detector

Pseudonymisation Pattern

Instead of redacting PII (which breaks context), replace it with consistent tokens before the LLM call and restore original values in the response. The model operates on pseudonyms; PII never reaches the provider.

Input: "John Smith (john@example.com) has a balance of $2,400"

Step 1 — Detect entities:

PERSON: "John Smith" → TOKEN_PERSON_001

EMAIL: "john@example.com" → TOKEN_EMAIL_001

FINANCIAL: "$2,400" → TOKEN_AMOUNT_001

Step 2 — Send to LLM:

"TOKEN_PERSON_001 (TOKEN_EMAIL_001) has a balance of TOKEN_AMOUNT_001"

Step 3 — Restore in response:

Replace tokens with original values in model output before delivery

Pseudonymisation preserves semantic context (the model understands it is processing a person with an email and balance) without sending actual PII to the provider.

Core GDPR principles

Lawful basis: you must have a legal basis (consent, contract, legitimate interest) for processing PII through an AI system
Data minimisation: do not send more PII to the model than is necessary for the task
Purpose limitation: PII collected for one purpose cannot be used to train models for a different purpose without new consent
Right to erasure: a user's request to be forgotten applies to AI logs and vector store embeddings

Hard requirements

Data Processing Agreement signed with every AI provider that processes PII
Record of processing activities updated to include AI systems
Data Protection Impact Assessment for high-risk AI processing
Never log raw user messages containing PII without pseudonymisation
PII in vector stores: must be deletable on erasure request (re-embed or delete affected chunks)

Checklist: Do You Understand This?

What is implicit PII — and why do automated redaction tools miss it?
Name five places PII can enter an AI pipeline beyond the obvious user prompt.
Explain the pseudonymisation pattern — what problem does it solve that simple redaction does not?
What GDPR principle prevents you from using customer support conversation data to fine-tune a marketing model?
How do you handle a GDPR right-to-erasure request for a user whose data was used to create embeddings in a vector store?
Which PII detection tool would you choose for a self-hosted on-premise deployment in a regulated healthcare environment?