Advanced

SLOs for AI Systems

Service Level Objectives (SLOs) for AI systems are harder to define than for traditional APIs. Latency is non-deterministic. Quality is not binary — a response can be partly correct. Hallucination is a quality failure with no error code. And models change under you: a provider update can silently shift behaviour without changing the API. Despite these challenges, SLOs are essential for holding AI systems to measurable standards.

Why AI SLOs Are Different

Traditional SLO assumptions that break

Output quality is binary (correct or error code) — AI outputs exist on a spectrum
Latency is predictable from input size — LLM latency varies with model load and output length
System behaviour is stable between deployments — model provider updates change behaviour without version bumps
Error rate captures all failures — quality failures return HTTP 200

AI-specific SLO requirements

Separate latency SLOs for TTFT and total response time
Quality SLOs measured through sampling, not error codes
Cost SLOs — staying within budget is an operational objective
Task success rate SLO for agentic systems — completion, not just response
More frequent SLO reviews (monthly, not quarterly) due to model change risk

Traditional SLO Dimensions Adapted for AI

Dimension	Traditional definition	AI adaptation
Availability	Percentage of requests that return non-5xx response	Same, but include: percentage of requests where the model returns a usable response (not a refusal or format error)
Latency	P50/P95/P99 response time	Track separately: TTFT P50/P95 (streaming start) and total latency P50/P95 (response complete)
Error rate	Percentage of requests returning 4xx/5xx	HTTP error rate + format error rate (JSON parse failures, schema violations) + refusal rate (model declined to answer)
Cost	N/A (traditional services have fixed infrastructure cost)	Cost per request P95; daily spend vs budget; cost per successful task completion

AI-Specific SLO Dimensions

Dimension	Definition	How to measure
Task success rate	Percentage of agentic tasks completed to a defined outcome	Instrument agent run end states: success / partial / failure / abandoned
Hallucination rate	Percentage of responses containing factual errors (for RAG/factual systems)	Automated: citation grounding check (is the answer supported by retrieved documents?); manual: sampled review
Guardrail trigger rate	Percentage of requests blocked by policy — a signal of system health	Count blocked/modified requests from guardrail middleware; alert on sudden changes
User satisfaction proxy	Aggregate of thumbs down rate, escalation rate, restart rate	Composite score from feedback signals; < 5% negative feedback as SLO target is common starting point

Latency SLOs

TTFT P95

First streaming token < 2s

Total latency P95

Full response < 10s

Quality SLOs

Hallucination rate

< 3% (sampled)

Task success rate

> 98% (agentic)

Format compliance

> 99.5% valid schema

Cost SLOs

Cost per request P95

Budget per query type

Daily spend vs budget

Alert on burn acceleration

Availability SLOs

HTTP availability

> 99.5% non-5xx

Usable response rate

No refusals / parse failures

AI SLOs span four dimensions — traditional services only needed the first and last

SLO Templates by System Type

RAG Chatbot SLOs

SLO	Target	Error budget (30-day)
Availability (usable response returned)	99.5%	3.6 hours of full outage
TTFT P95 (first streaming token)	< 2 seconds	5% of requests may exceed
Total latency P95 (full response)	< 10 seconds	5% of requests may exceed
Hallucination rate (citation grounding)	< 3%	Sampled; alert if > 3% in any 24h window
Negative feedback rate	< 5%	Rolling 7-day average

AI Agent (Document Processing) SLOs

SLO	Target	Error budget (30-day)
Task success rate (document fully processed)	98%	2% of documents may require manual review
End-to-end processing time P95	< 120 seconds per document	5% of documents may take longer
Structured output format compliance	99.5%	0.5% schema parse failures acceptable
Cost per document P95	< $0.05	Cost SLO; alert when P95 exceeds target for 24h

Error Budget for AI Quality Failures

Quality event flagged

Hallucination, format failure, escalation

→

Count as error event

Against the relevant SLO budget

→

Track burn rate daily

Events consumed / budget remaining

→

Budget < 20%?

Yes → freeze non-critical changes

→

Budget exhausted

Incident declared — SLO violated

Quality failures burn error budget just like HTTP errors — measure and act on both

Quality failures consume error budget just as availability failures do. Count flagged quality events (hallucinations, format errors, negative feedback above threshold) as error events for the purpose of error budget calculation.

Each sampled hallucination = one error event against the quality SLO error budget
Each format parse failure = one error event against the availability/format SLO
Each user escalation from AI to human = one error event against task success SLO
Track error budget burn rate daily; freeze non-critical changes when budget < 20% remaining

SLO Review Cadence for AI Systems

Review monthly — AI systems have more change vectors than traditional services

Traditional services change only when you deploy. AI systems change when: (1) you deploy code changes; (2) your prompts change; (3) the model provider updates their model; (4) your RAG corpus is updated; (5) usage patterns shift to new query types. A quarterly SLO review misses two to three model provider updates in the review period. Monthly reviews catch regressions before they become patterns. For high-risk AI systems under the EU AI Act, logging and review requirements may impose more frequent cadences.

Checklist: Do You Understand This?

Why does an AI system need separate latency SLOs for TTFT and total response time?
How do you measure hallucination rate in a RAG system without manual review of every response?
What is an error budget, and how should quality failures be counted against it?
Write three SLOs for a code assistant — include at least one quality SLO and one latency SLO.
Why should AI SLOs be reviewed monthly rather than quarterly?
Name four change vectors in an AI system that can degrade SLO performance without a code deployment.