🧠 All Things AI
Advanced

SLOs for AI Systems

Service Level Objectives (SLOs) for AI systems are harder to define than for traditional APIs. Latency is non-deterministic. Quality is not binary — a response can be partly correct. Hallucination is a quality failure with no error code. And models change under you: a provider update can silently shift behaviour without changing the API. Despite these challenges, SLOs are essential for holding AI systems to measurable standards.

Why AI SLOs Are Different

Traditional SLO assumptions that break

  • Output quality is binary (correct or error code) — AI outputs exist on a spectrum
  • Latency is predictable from input size — LLM latency varies with model load and output length
  • System behaviour is stable between deployments — model provider updates change behaviour without version bumps
  • Error rate captures all failures — quality failures return HTTP 200

AI-specific SLO requirements

  • Separate latency SLOs for TTFT and total response time
  • Quality SLOs measured through sampling, not error codes
  • Cost SLOs — staying within budget is an operational objective
  • Task success rate SLO for agentic systems — completion, not just response
  • More frequent SLO reviews (monthly, not quarterly) due to model change risk

Traditional SLO Dimensions Adapted for AI

DimensionTraditional definitionAI adaptation
AvailabilityPercentage of requests that return non-5xx responseSame, but include: percentage of requests where the model returns a usable response (not a refusal or format error)
LatencyP50/P95/P99 response timeTrack separately: TTFT P50/P95 (streaming start) and total latency P50/P95 (response complete)
Error ratePercentage of requests returning 4xx/5xxHTTP error rate + format error rate (JSON parse failures, schema violations) + refusal rate (model declined to answer)
CostN/A (traditional services have fixed infrastructure cost)Cost per request P95; daily spend vs budget; cost per successful task completion

AI-Specific SLO Dimensions

DimensionDefinitionHow to measure
Task success ratePercentage of agentic tasks completed to a defined outcomeInstrument agent run end states: success / partial / failure / abandoned
Hallucination ratePercentage of responses containing factual errors (for RAG/factual systems)Automated: citation grounding check (is the answer supported by retrieved documents?); manual: sampled review
Guardrail trigger ratePercentage of requests blocked by policy — a signal of system healthCount blocked/modified requests from guardrail middleware; alert on sudden changes
User satisfaction proxyAggregate of thumbs down rate, escalation rate, restart rateComposite score from feedback signals; < 5% negative feedback as SLO target is common starting point
Latency SLOs
TTFT P95
First streaming token < 2s
Total latency P95
Full response < 10s
Quality SLOs
Hallucination rate
< 3% (sampled)
Task success rate
> 98% (agentic)
Format compliance
> 99.5% valid schema
Cost SLOs
Cost per request P95
Budget per query type
Daily spend vs budget
Alert on burn acceleration
Availability SLOs
HTTP availability
> 99.5% non-5xx
Usable response rate
No refusals / parse failures

AI SLOs span four dimensions — traditional services only needed the first and last

SLO Templates by System Type

RAG Chatbot SLOs

SLOTargetError budget (30-day)
Availability (usable response returned)99.5%3.6 hours of full outage
TTFT P95 (first streaming token)< 2 seconds5% of requests may exceed
Total latency P95 (full response)< 10 seconds5% of requests may exceed
Hallucination rate (citation grounding)< 3%Sampled; alert if > 3% in any 24h window
Negative feedback rate< 5%Rolling 7-day average

AI Agent (Document Processing) SLOs

SLOTargetError budget (30-day)
Task success rate (document fully processed)98%2% of documents may require manual review
End-to-end processing time P95< 120 seconds per document5% of documents may take longer
Structured output format compliance99.5%0.5% schema parse failures acceptable
Cost per document P95< $0.05Cost SLO; alert when P95 exceeds target for 24h

Error Budget for AI Quality Failures

Quality event flagged
Hallucination, format failure, escalation
Count as error event
Against the relevant SLO budget
Track burn rate daily
Events consumed / budget remaining
Budget < 20%?
Yes → freeze non-critical changes
Budget exhausted
Incident declared — SLO violated

Quality failures burn error budget just like HTTP errors — measure and act on both

Quality failures consume error budget just as availability failures do. Count flagged quality events (hallucinations, format errors, negative feedback above threshold) as error events for the purpose of error budget calculation.

  • Each sampled hallucination = one error event against the quality SLO error budget
  • Each format parse failure = one error event against the availability/format SLO
  • Each user escalation from AI to human = one error event against task success SLO
  • Track error budget burn rate daily; freeze non-critical changes when budget < 20% remaining

SLO Review Cadence for AI Systems

Review monthly — AI systems have more change vectors than traditional services

Traditional services change only when you deploy. AI systems change when: (1) you deploy code changes; (2) your prompts change; (3) the model provider updates their model; (4) your RAG corpus is updated; (5) usage patterns shift to new query types. A quarterly SLO review misses two to three model provider updates in the review period. Monthly reviews catch regressions before they become patterns. For high-risk AI systems under the EU AI Act, logging and review requirements may impose more frequent cadences.

Checklist: Do You Understand This?

  • Why does an AI system need separate latency SLOs for TTFT and total response time?
  • How do you measure hallucination rate in a RAG system without manual review of every response?
  • What is an error budget, and how should quality failures be counted against it?
  • Write three SLOs for a code assistant — include at least one quality SLO and one latency SLO.
  • Why should AI SLOs be reviewed monthly rather than quarterly?
  • Name four change vectors in an AI system that can degrade SLO performance without a code deployment.