Monitoring & Alerting for AI Systems
Standard APM (application performance monitoring) covers infrastructure and error rates but misses what matters most in AI systems: response quality, cost per request, token patterns, and guardrail behaviour. AI monitoring requires a separate layer that captures LLM-specific signals and connects them to business outcomes.
AI-Specific Metrics Beyond Standard APM
| Metric | What it measures | Alert threshold |
|---|---|---|
| Input tokens per request | Prompt length; growing input tokens = growing cost + latency | Alert when P95 input tokens doubles week-over-week |
| Output tokens per request | Response length; high output tokens = verbose responses or no max_tokens limit | Alert when P95 output tokens exceeds configured max_tokens by > 10% |
| Cost per request | Actual spend per LLM call; most important unit for FinOps | Alert when use-case daily cost exceeds 2× rolling 7-day average |
| Cache hit rate | Percentage of requests served from prompt or semantic cache | Alert when cache hit rate drops > 20% from baseline (indicates prompt structure change) |
| Guardrail trigger rate | Percentage of requests blocked or modified by input/output policy | Alert on sudden spike (> 3× baseline) — possible coordinated attack; alert on zero (guardrails may be broken) |
| Tool call count per request | For agentic systems: number of tool invocations per agent run | Alert when P95 tool calls exceeds configured step limit |
| Retry rate | Percentage of requests that required at least one retry | Alert at > 5% — indicates provider instability or rate limit pressure |
| Hallucination flag rate | Percentage of responses flagged by automated quality checks | Alert when rate increases > 2× baseline — possible model regression or prompt drift |
Latency Breakdown
Total end-to-end latency for an AI request has multiple distinct components. Alert on each separately — a spike in one component has different root causes than a spike in another.
Latency components to track
- TTFT (Time to First Token) — server receives request to first streaming token; reflects model load and prompt processing time
- Generation time — first token to last token; proportional to output length
- Tool execution time — for agentic systems; time spent in tool calls outside the LLM
- Total end-to-end — from user request to response complete; includes all above + network + your processing
Recommended alert thresholds
- TTFT P95 > 3s for interactive use case → investigate model load or retry overhead
- Total latency P95 > 2× P50 → tail latency problem; check for outlier inputs
- Tool execution P95 > 5s → slow external API or database query; fix the tool, not the LLM
- Latency spike correlating with 429 rate spike → rate limit is the cause; add backoff or route to secondary
Quality Signal Monitoring
Quality failures in AI systems do not produce error codes — the request succeeds (HTTP 200) but the response is wrong or unhelpful. Quality must be measured separately through indirect signals and sampling.
| Quality signal | What it indicates | How to collect |
|---|---|---|
| User thumbs down / negative feedback | User found the response unhelpful, wrong, or inappropriate | Inline feedback widget; track rate per use case and per model version |
| Escalation rate | User escalated to human agent after AI response; AI failed to resolve | Track escalation events in your support workflow; correlate with AI conversation IDs |
| Task success rate | Percentage of agentic tasks completed successfully vs abandoned or errored | Instrument agent run outcomes; distinguish tool errors from quality failures |
| Format error rate | Model returned wrong format (e.g., JSON parse failed, required fields missing) | Validate structured outputs at application layer; log and count validation failures |
| Conversation restart rate | User started a new conversation immediately after previous one; implicit signal of failure | Detect short-session conversations followed by restart within 2 minutes |
Observability Stack Recommendations
Langfuse (recommended for most teams)
- Open source; self-hosted or cloud; full trace capture
- Token cost tracking per model / use case / user
- Prompt versioning and experiment comparison
- SDK for Python / TypeScript; LiteLLM native integration
- 2025 status: most widely adopted open-source LLM observability platform
LangSmith (LangChain ecosystem)
- SaaS; deep integration with LangChain and LangGraph
- Trace visualisation for complex agentic workflows
- Human annotation tools for quality labelling
- Best choice if you are building on LangChain/LangGraph
- Less suitable for non-LangChain architectures
Complement either with your existing infrastructure monitoring stack (Datadog, Grafana, Prometheus) for infrastructure metrics (host CPU, memory, network). AI observability tools cover LLM-specific signals; standard APM covers everything else. Do not try to make one tool do both.
SLO-Aligned Alerting
Alert on SLO burn rate, not just threshold crossings
A single high-latency request is noise. A sustained burn through your error budget is signal. SLO-aligned alerting fires when the rate of bad events is high enough to exhaust your error budget within a defined window (e.g., 1 hour at current burn rate = error budget exhausted in 24 hours). This prevents alert fatigue from transient spikes while catching sustained degradations before they exhaust SLO headroom.
Checklist: Do You Understand This?
- Why does standard APM miss the most important signals in an AI system — name three AI-specific metrics it cannot capture.
- What is TTFT and why should it be alerted on separately from total end-to-end latency?
- Why does a guardrail trigger rate of zero warrant an alert — not just a high rate?
- Name three quality signals that can be collected without asking users to rate responses.
- What is the difference between Langfuse and LangSmith — and when would you choose each?
- What is SLO burn rate alerting, and why is it preferable to simple threshold-crossing alerts for AI systems?