Advanced

Data Retention for AI Systems

AI systems generate more data types than traditional applications: conversation logs, prompt history, model outputs, user feedback, fine-tuning datasets, audit traces, and vector embeddings. Each data type may have different regulatory retention requirements, different deletion complexities, and different business needs. A retention policy that covers only databases while ignoring AI-specific data types is incomplete.

AI-Specific Data Types Requiring Retention Policy

Data type	Why it needs a policy	Deletion complexity
Conversation logs	Contains user PII; GDPR applies; required for incident investigation	Low — delete rows from log store; redact PII at creation
Prompt and output history	PII in prompts; regulatory obligation for financial/medical AI	Low if pseudonymised at creation
Vector embeddings	Derived from documents that may contain PII; partial semantic reconstruction possible	Medium — requires re-embedding or chunk-level deletion
Fine-tuning datasets	Often derived from user interactions; PII inherited from source data	High — requires data lineage tracking to find which records contain specific user data
Audit traces	Required by regulation; must be retained for defined period; cannot be deleted early	N/A — must be retained; PII redacted from user-identifying fields
User feedback and annotations	Contains user judgements; PII in feedback text; used for model improvement	Low — structured records with user_id foreign key

Retention Periods by Regulation

Regulation	Retention requirement	Applies to AI when
GDPR	Purpose-limited: retain only as long as necessary for the stated purpose; erasure on request	Processing EU personal data in any AI component
HIPAA	Medical records: 6 years from creation or last use	AI systems processing Protected Health Information (PHI)
EU AI Act (Article 12)	High-risk AI: logs retained for duration of AI system lifetime + period defined by NCA	High-risk AI systems under EU AI Act classification
Financial services	Typically 5-7 years for records of financial decisions (jurisdiction-dependent)	AI systems involved in credit, trading, or financial advice
Employment law	Varies; typically 3-7 years for HR decisions; retain if AI influenced hiring/firing	AI used in recruitment, performance review, or HR decisions

Tiered Retention Architecture

Hot tier (0-30 days)

Active conversation sessions
Recent audit logs for incident response
Fast-access query store (PostgreSQL, DynamoDB)

Warm tier (30 days–2 years)

Audit traces for regulatory review
Analytical query access
Compressed object storage (S3 Standard-IA)

Cold tier (2+ years)

Regulatory hold data
Rare access; legal / compliance retrieval
Glacier / Archive class storage

Right to Erasure in AI Systems

Handling GDPR right-to-erasure requests is harder for AI systems than for traditional databases. Each data type requires a separate deletion procedure.

Data type	Erasure procedure
Conversation logs	Delete rows by user_id; redact PII fields if full deletion conflicts with audit retention
Vector embeddings	Delete chunks whose source documents contain the user's data; re-embed remaining corpus
Fine-tuning datasets	Requires data lineage: trace which training examples contain the user's data; remove and retrain if necessary
Audit traces	Cannot delete (required by regulation); redact PII fields while preserving the event record

Automated Deletion Pipelines

TTL on conversation logs in the database — auto-delete after N days without manual intervention
Scheduled vector store cleanup — identify and delete embeddings whose source documents have exceeded retention
Fine-tuning data quarantine — tag records by user_id at creation so erasure requests can be processed without manual data lineage tracing
Deletion verification — confirm that deleted data does not reappear in backups or replicas

Checklist: Do You Understand This?

Why is deleting user data from a fine-tuning dataset harder than deleting it from a conversation log?
What is the GDPR retention principle for AI conversation logs — and how does it conflict with the need to retain audit traces?
What retention period applies to an AI system that assisted in a UK financial advisor's client recommendations?
Design an erasure procedure for a user whose conversations were used to fine-tune an internal model.
Why does the EU AI Act (Article 12) create a retention obligation rather than a deletion obligation?
What is the purpose of tagging fine-tuning records with user_id at creation time?