PII Handling & Redaction
PII in AI pipelines is a broader problem than PII in traditional software. Every component that touches user content — prompts, retrieved documents, conversation history, tool responses, fine-tuning datasets — is a potential PII exposure point. GDPR, HIPAA, and sector-specific regulations apply to PII regardless of whether it is in a database or in an LLM context window.
PII Taxonomy in AI Context
| Category | Examples | AI-specific risk |
|---|---|---|
| Direct identifiers | Name, email, phone, SSN, passport number, national ID | Appears in user prompts and tool responses; extracted and repeated in model output |
| Quasi-identifiers | Date of birth, postcode, employer, job title combined | Individually innocuous; combined in long context windows they re-identify individuals |
| Sensitive categories (GDPR) | Health data, sexual orientation, religion, political opinion, biometric data | Higher legal protection; requires explicit consent; prohibit entirely or route to isolated pipeline |
| Implicit PII | Descriptions that uniquely identify a person without naming them | Automated redaction misses this; requires semantic understanding to detect |
Where PII Enters AI Systems
- User prompts: users naturally include personal details when asking questions about their situation
- RAG corpus: documents ingested into the knowledge base may contain customer or employee PII
- Tool responses: CRM, HR, or medical system APIs return PII in their responses to the agent
- Conversation history: multi-turn conversations accumulate PII across the session
- Fine-tuning datasets: training data derived from historical user interactions may contain PII
Redaction Before LLM
| Tool | Approach | Accuracy | Deployment |
|---|---|---|---|
| Microsoft Presidio | NER-based entity detection; 50+ entity types; customisable recognisers | High for standard types; misses implicit PII | Open-source; self-hosted; Python library |
| AWS Comprehend PII | Managed NLP service; 18 entity types; confidence scores per entity | Good; English-primary; limited customisation | Cloud (AWS); pay per character; easy integration with Bedrock |
| Azure AI PII Detection | Cognitive Services; 40+ categories; multilingual support | Good multilingual coverage; context-aware | Cloud (Azure); REST API; integrates with Azure AI stack |
| Custom regex + rules | Pattern-based detection for known formats (SSN, credit card, postcode) | High for structured formats; misses freeform PII | No cost; fast; use as a supplement, not primary detector |
Pseudonymisation Pattern
Instead of redacting PII (which breaks context), replace it with consistent tokens before the LLM call and restore original values in the response. The model operates on pseudonyms; PII never reaches the provider.
Input: "John Smith (john@example.com) has a balance of $2,400"
Step 1 — Detect entities:
PERSON: "John Smith" → TOKEN_PERSON_001
EMAIL: "john@example.com" → TOKEN_EMAIL_001
FINANCIAL: "$2,400" → TOKEN_AMOUNT_001
Step 2 — Send to LLM:
"TOKEN_PERSON_001 (TOKEN_EMAIL_001) has a balance of TOKEN_AMOUNT_001"
Step 3 — Restore in response:
Replace tokens with original values in model output before delivery
Pseudonymisation preserves semantic context (the model understands it is processing a person with an email and balance) without sending actual PII to the provider.
GDPR Obligations for AI Systems
Core GDPR principles
- Lawful basis: you must have a legal basis (consent, contract, legitimate interest) for processing PII through an AI system
- Data minimisation: do not send more PII to the model than is necessary for the task
- Purpose limitation: PII collected for one purpose cannot be used to train models for a different purpose without new consent
- Right to erasure: a user's request to be forgotten applies to AI logs and vector store embeddings
Hard requirements
- Data Processing Agreement signed with every AI provider that processes PII
- Record of processing activities updated to include AI systems
- Data Protection Impact Assessment for high-risk AI processing
- Never log raw user messages containing PII without pseudonymisation
- PII in vector stores: must be deletable on erasure request (re-embed or delete affected chunks)
Checklist: Do You Understand This?
- What is implicit PII — and why do automated redaction tools miss it?
- Name five places PII can enter an AI pipeline beyond the obvious user prompt.
- Explain the pseudonymisation pattern — what problem does it solve that simple redaction does not?
- What GDPR principle prevents you from using customer support conversation data to fine-tune a marketing model?
- How do you handle a GDPR right-to-erasure request for a user whose data was used to create embeddings in a vector store?
- Which PII detection tool would you choose for a self-hosted on-premise deployment in a regulated healthcare environment?