Advanced

Vendor Risk Management

An enterprise AI programme typically depends on 8-15 vendors: a frontier model provider, a managed AI platform, a vector database, an observability tool, guardrails, and several others. Each vendor is a dependency risk — a security incident, pricing change, model deprecation, or acquisition at any one of them can affect your production AI systems. Systematic vendor risk management is not bureaucracy; it is the difference between a managed disruption and a production emergency.

AI Vendor Categories and Risk Levels

Category	Examples	Risk level	Key risk
Frontier model API	OpenAI, Anthropic, Google	Critical	All AI capability depends on availability and DPA terms
Managed AI platform	AWS Bedrock, Azure AI Foundry, GCP Vertex	Critical	Data processing; multi-model access; vendor lock-in risk
Open-weight model	Meta LLaMA, Mistral, Qwen	High	No vendor SLA; licence terms; you own security and compliance
Vector database	Pinecone, Weaviate, Qdrant	High	Data residency; retention; availability SLA
Observability / tracing	Langfuse, LangSmith, Datadog LLM	Medium	Prompt content in logs; data processing terms
Guardrails / safety	Lakera Guard, Azure AI Content Safety	Medium	Latency dependency; accuracy SLA

Due Diligence Checklist

Complete this checklist before onboarding any vendor that processes your data or contributes to your production AI stack.

Security certifications

SOC 2 Type II (preferred over Type I — Type II tests over time)
ISO 27001 for international requirements
HIPAA BAA if processing PHI
FedRAMP if US government data
Most recent pen test report (less than 12 months)

Data handling

Data Processing Agreement (DPA) signed before any data is shared
Subprocessor list reviewed and approved
Data residency options meet your regulatory requirements
Training on customer data: opt-out confirmed in writing
Incident notification SLA (72 hours for GDPR)
Data deletion on contract termination: confirmed timeline

Data Processing Agreement Key Clauses

Clause	What to look for
Training data use	Explicit opt-out from using your data to train or improve models; confirm this is the enterprise tier default
Data retention	How long prompts and responses are retained; confirm it matches your data minimisation requirements
Deletion on request	Timeline for deleting your data on request; should align with GDPR erasure obligations (typically 30 days)
Breach notification	72-hour notification SLA for GDPR; confirm the notification goes to your legal/DPO, not just a generic email
Subprocessors	List of subprocessors must be available; you must be notified before new subprocessors are added

Vendor Concentration Risk

Over-reliance on a single provider is a reliability risk

Relying on a single frontier model provider for all AI capability means that provider's outages, price changes, model deprecations, or DPA changes directly impact your production systems. Mitigations:

Provider abstraction layer (LiteLLM, AWS Bedrock unified endpoint) — swap providers without application code changes
Multi-provider failover for critical use cases
Open-weight model as fallback for non-sensitive use cases
Document the migration path to a different primary provider before you need it

Ongoing Vendor Monitoring

Subscribe to vendor security bulletins and status pages
Track DPA change notifications — providers update DPAs and notify by email; document reviews and approvals
Monitor model deprecation announcements — providers typically give 3-6 months notice; act on it
Track pricing changes — AI pricing moves rapidly; cost models built on today's prices need quarterly review
Annual vendor security reassessment — request updated SOC 2 reports and confirm DPA still matches your data processing

Checklist: Do You Understand This?

What makes a frontier model API a "Critical" risk category vendor — and what does that imply for due diligence?
What is the difference between SOC 2 Type I and Type II — and which should you require?
What five clauses in a Data Processing Agreement are most important for an AI vendor relationship?
What is vendor concentration risk — and what architectural mitigations reduce it?
Why do you need to monitor DPA change notifications ongoing, not just at onboarding?
What is the risk of using an open-weight model (like LLaMA) compared to a managed frontier model API?