SLOs for AI Systems
Service Level Objectives (SLOs) for AI systems are harder to define than for traditional APIs. Latency is non-deterministic. Quality is not binary — a response can be partly correct. Hallucination is a quality failure with no error code. And models change under you: a provider update can silently shift behaviour without changing the API. Despite these challenges, SLOs are essential for holding AI systems to measurable standards.
Why AI SLOs Are Different
Traditional SLO assumptions that break
- Output quality is binary (correct or error code) — AI outputs exist on a spectrum
- Latency is predictable from input size — LLM latency varies with model load and output length
- System behaviour is stable between deployments — model provider updates change behaviour without version bumps
- Error rate captures all failures — quality failures return HTTP 200
AI-specific SLO requirements
- Separate latency SLOs for TTFT and total response time
- Quality SLOs measured through sampling, not error codes
- Cost SLOs — staying within budget is an operational objective
- Task success rate SLO for agentic systems — completion, not just response
- More frequent SLO reviews (monthly, not quarterly) due to model change risk
Traditional SLO Dimensions Adapted for AI
| Dimension | Traditional definition | AI adaptation |
|---|---|---|
| Availability | Percentage of requests that return non-5xx response | Same, but include: percentage of requests where the model returns a usable response (not a refusal or format error) |
| Latency | P50/P95/P99 response time | Track separately: TTFT P50/P95 (streaming start) and total latency P50/P95 (response complete) |
| Error rate | Percentage of requests returning 4xx/5xx | HTTP error rate + format error rate (JSON parse failures, schema violations) + refusal rate (model declined to answer) |
| Cost | N/A (traditional services have fixed infrastructure cost) | Cost per request P95; daily spend vs budget; cost per successful task completion |
AI-Specific SLO Dimensions
| Dimension | Definition | How to measure |
|---|---|---|
| Task success rate | Percentage of agentic tasks completed to a defined outcome | Instrument agent run end states: success / partial / failure / abandoned |
| Hallucination rate | Percentage of responses containing factual errors (for RAG/factual systems) | Automated: citation grounding check (is the answer supported by retrieved documents?); manual: sampled review |
| Guardrail trigger rate | Percentage of requests blocked by policy — a signal of system health | Count blocked/modified requests from guardrail middleware; alert on sudden changes |
| User satisfaction proxy | Aggregate of thumbs down rate, escalation rate, restart rate | Composite score from feedback signals; < 5% negative feedback as SLO target is common starting point |
AI SLOs span four dimensions — traditional services only needed the first and last
SLO Templates by System Type
RAG Chatbot SLOs
| SLO | Target | Error budget (30-day) |
|---|---|---|
| Availability (usable response returned) | 99.5% | 3.6 hours of full outage |
| TTFT P95 (first streaming token) | < 2 seconds | 5% of requests may exceed |
| Total latency P95 (full response) | < 10 seconds | 5% of requests may exceed |
| Hallucination rate (citation grounding) | < 3% | Sampled; alert if > 3% in any 24h window |
| Negative feedback rate | < 5% | Rolling 7-day average |
AI Agent (Document Processing) SLOs
| SLO | Target | Error budget (30-day) |
|---|---|---|
| Task success rate (document fully processed) | 98% | 2% of documents may require manual review |
| End-to-end processing time P95 | < 120 seconds per document | 5% of documents may take longer |
| Structured output format compliance | 99.5% | 0.5% schema parse failures acceptable |
| Cost per document P95 | < $0.05 | Cost SLO; alert when P95 exceeds target for 24h |
Error Budget for AI Quality Failures
Quality failures burn error budget just like HTTP errors — measure and act on both
Quality failures consume error budget just as availability failures do. Count flagged quality events (hallucinations, format errors, negative feedback above threshold) as error events for the purpose of error budget calculation.
- Each sampled hallucination = one error event against the quality SLO error budget
- Each format parse failure = one error event against the availability/format SLO
- Each user escalation from AI to human = one error event against task success SLO
- Track error budget burn rate daily; freeze non-critical changes when budget < 20% remaining
SLO Review Cadence for AI Systems
Review monthly — AI systems have more change vectors than traditional services
Traditional services change only when you deploy. AI systems change when: (1) you deploy code changes; (2) your prompts change; (3) the model provider updates their model; (4) your RAG corpus is updated; (5) usage patterns shift to new query types. A quarterly SLO review misses two to three model provider updates in the review period. Monthly reviews catch regressions before they become patterns. For high-risk AI systems under the EU AI Act, logging and review requirements may impose more frequent cadences.
Checklist: Do You Understand This?
- Why does an AI system need separate latency SLOs for TTFT and total response time?
- How do you measure hallucination rate in a RAG system without manual review of every response?
- What is an error budget, and how should quality failures be counted against it?
- Write three SLOs for a code assistant — include at least one quality SLO and one latency SLO.
- Why should AI SLOs be reviewed monthly rather than quarterly?
- Name four change vectors in an AI system that can degrade SLO performance without a code deployment.