🧠 All Things AI
Advanced

PII Handling & Redaction

PII in AI pipelines is a broader problem than PII in traditional software. Every component that touches user content — prompts, retrieved documents, conversation history, tool responses, fine-tuning datasets — is a potential PII exposure point. GDPR, HIPAA, and sector-specific regulations apply to PII regardless of whether it is in a database or in an LLM context window.

PII Taxonomy in AI Context

CategoryExamplesAI-specific risk
Direct identifiersName, email, phone, SSN, passport number, national IDAppears in user prompts and tool responses; extracted and repeated in model output
Quasi-identifiersDate of birth, postcode, employer, job title combinedIndividually innocuous; combined in long context windows they re-identify individuals
Sensitive categories (GDPR)Health data, sexual orientation, religion, political opinion, biometric dataHigher legal protection; requires explicit consent; prohibit entirely or route to isolated pipeline
Implicit PIIDescriptions that uniquely identify a person without naming themAutomated redaction misses this; requires semantic understanding to detect

Where PII Enters AI Systems

  • User prompts: users naturally include personal details when asking questions about their situation
  • RAG corpus: documents ingested into the knowledge base may contain customer or employee PII
  • Tool responses: CRM, HR, or medical system APIs return PII in their responses to the agent
  • Conversation history: multi-turn conversations accumulate PII across the session
  • Fine-tuning datasets: training data derived from historical user interactions may contain PII

Redaction Before LLM

ToolApproachAccuracyDeployment
Microsoft PresidioNER-based entity detection; 50+ entity types; customisable recognisersHigh for standard types; misses implicit PIIOpen-source; self-hosted; Python library
AWS Comprehend PIIManaged NLP service; 18 entity types; confidence scores per entityGood; English-primary; limited customisationCloud (AWS); pay per character; easy integration with Bedrock
Azure AI PII DetectionCognitive Services; 40+ categories; multilingual supportGood multilingual coverage; context-awareCloud (Azure); REST API; integrates with Azure AI stack
Custom regex + rulesPattern-based detection for known formats (SSN, credit card, postcode)High for structured formats; misses freeform PIINo cost; fast; use as a supplement, not primary detector

Pseudonymisation Pattern

Instead of redacting PII (which breaks context), replace it with consistent tokens before the LLM call and restore original values in the response. The model operates on pseudonyms; PII never reaches the provider.

Input: "John Smith (john@example.com) has a balance of $2,400"

Step 1 — Detect entities:

PERSON: "John Smith" → TOKEN_PERSON_001

EMAIL: "john@example.com" → TOKEN_EMAIL_001

FINANCIAL: "$2,400" → TOKEN_AMOUNT_001

Step 2 — Send to LLM:

"TOKEN_PERSON_001 (TOKEN_EMAIL_001) has a balance of TOKEN_AMOUNT_001"

Step 3 — Restore in response:

Replace tokens with original values in model output before delivery

Pseudonymisation preserves semantic context (the model understands it is processing a person with an email and balance) without sending actual PII to the provider.

GDPR Obligations for AI Systems

Core GDPR principles

  • Lawful basis: you must have a legal basis (consent, contract, legitimate interest) for processing PII through an AI system
  • Data minimisation: do not send more PII to the model than is necessary for the task
  • Purpose limitation: PII collected for one purpose cannot be used to train models for a different purpose without new consent
  • Right to erasure: a user's request to be forgotten applies to AI logs and vector store embeddings

Hard requirements

  • Data Processing Agreement signed with every AI provider that processes PII
  • Record of processing activities updated to include AI systems
  • Data Protection Impact Assessment for high-risk AI processing
  • Never log raw user messages containing PII without pseudonymisation
  • PII in vector stores: must be deletable on erasure request (re-embed or delete affected chunks)

Checklist: Do You Understand This?

  • What is implicit PII — and why do automated redaction tools miss it?
  • Name five places PII can enter an AI pipeline beyond the obvious user prompt.
  • Explain the pseudonymisation pattern — what problem does it solve that simple redaction does not?
  • What GDPR principle prevents you from using customer support conversation data to fine-tune a marketing model?
  • How do you handle a GDPR right-to-erasure request for a user whose data was used to create embeddings in a vector store?
  • Which PII detection tool would you choose for a self-hosted on-premise deployment in a regulated healthcare environment?