Advanced
Data Retention for AI Systems
AI systems generate more data types than traditional applications: conversation logs, prompt history, model outputs, user feedback, fine-tuning datasets, audit traces, and vector embeddings. Each data type may have different regulatory retention requirements, different deletion complexities, and different business needs. A retention policy that covers only databases while ignoring AI-specific data types is incomplete.
AI-Specific Data Types Requiring Retention Policy
| Data type | Why it needs a policy | Deletion complexity |
|---|---|---|
| Conversation logs | Contains user PII; GDPR applies; required for incident investigation | Low — delete rows from log store; redact PII at creation |
| Prompt and output history | PII in prompts; regulatory obligation for financial/medical AI | Low if pseudonymised at creation |
| Vector embeddings | Derived from documents that may contain PII; partial semantic reconstruction possible | Medium — requires re-embedding or chunk-level deletion |
| Fine-tuning datasets | Often derived from user interactions; PII inherited from source data | High — requires data lineage tracking to find which records contain specific user data |
| Audit traces | Required by regulation; must be retained for defined period; cannot be deleted early | N/A — must be retained; PII redacted from user-identifying fields |
| User feedback and annotations | Contains user judgements; PII in feedback text; used for model improvement | Low — structured records with user_id foreign key |
Retention Periods by Regulation
| Regulation | Retention requirement | Applies to AI when |
|---|---|---|
| GDPR | Purpose-limited: retain only as long as necessary for the stated purpose; erasure on request | Processing EU personal data in any AI component |
| HIPAA | Medical records: 6 years from creation or last use | AI systems processing Protected Health Information (PHI) |
| EU AI Act (Article 12) | High-risk AI: logs retained for duration of AI system lifetime + period defined by NCA | High-risk AI systems under EU AI Act classification |
| Financial services | Typically 5-7 years for records of financial decisions (jurisdiction-dependent) | AI systems involved in credit, trading, or financial advice |
| Employment law | Varies; typically 3-7 years for HR decisions; retain if AI influenced hiring/firing | AI used in recruitment, performance review, or HR decisions |
Tiered Retention Architecture
Hot tier (0-30 days)
- Active conversation sessions
- Recent audit logs for incident response
- Fast-access query store (PostgreSQL, DynamoDB)
Warm tier (30 days–2 years)
- Audit traces for regulatory review
- Analytical query access
- Compressed object storage (S3 Standard-IA)
Cold tier (2+ years)
- Regulatory hold data
- Rare access; legal / compliance retrieval
- Glacier / Archive class storage
Right to Erasure in AI Systems
Handling GDPR right-to-erasure requests is harder for AI systems than for traditional databases. Each data type requires a separate deletion procedure.
| Data type | Erasure procedure |
|---|---|
| Conversation logs | Delete rows by user_id; redact PII fields if full deletion conflicts with audit retention |
| Vector embeddings | Delete chunks whose source documents contain the user's data; re-embed remaining corpus |
| Fine-tuning datasets | Requires data lineage: trace which training examples contain the user's data; remove and retrain if necessary |
| Audit traces | Cannot delete (required by regulation); redact PII fields while preserving the event record |
Automated Deletion Pipelines
- TTL on conversation logs in the database — auto-delete after N days without manual intervention
- Scheduled vector store cleanup — identify and delete embeddings whose source documents have exceeded retention
- Fine-tuning data quarantine — tag records by user_id at creation so erasure requests can be processed without manual data lineage tracing
- Deletion verification — confirm that deleted data does not reappear in backups or replicas
Checklist: Do You Understand This?
- Why is deleting user data from a fine-tuning dataset harder than deleting it from a conversation log?
- What is the GDPR retention principle for AI conversation logs — and how does it conflict with the need to retain audit traces?
- What retention period applies to an AI system that assisted in a UK financial advisor's client recommendations?
- Design an erasure procedure for a user whose conversations were used to fine-tune an internal model.
- Why does the EU AI Act (Article 12) create a retention obligation rather than a deletion obligation?
- What is the purpose of tagging fine-tuning records with user_id at creation time?