Advanced

Monitoring & Alerting for AI Systems

Standard APM (application performance monitoring) covers infrastructure and error rates but misses what matters most in AI systems: response quality, cost per request, token patterns, and guardrail behaviour. AI monitoring requires a separate layer that captures LLM-specific signals and connects them to business outcomes.

AI-Specific Metrics Beyond Standard APM

Metric	What it measures	Alert threshold
Input tokens per request	Prompt length; growing input tokens = growing cost + latency	Alert when P95 input tokens doubles week-over-week
Output tokens per request	Response length; high output tokens = verbose responses or no max_tokens limit	Alert when P95 output tokens exceeds configured max_tokens by > 10%
Cost per request	Actual spend per LLM call; most important unit for FinOps	Alert when use-case daily cost exceeds 2× rolling 7-day average
Cache hit rate	Percentage of requests served from prompt or semantic cache	Alert when cache hit rate drops > 20% from baseline (indicates prompt structure change)
Guardrail trigger rate	Percentage of requests blocked or modified by input/output policy	Alert on sudden spike (> 3× baseline) — possible coordinated attack; alert on zero (guardrails may be broken)
Tool call count per request	For agentic systems: number of tool invocations per agent run	Alert when P95 tool calls exceeds configured step limit
Retry rate	Percentage of requests that required at least one retry	Alert at > 5% — indicates provider instability or rate limit pressure
Hallucination flag rate	Percentage of responses flagged by automated quality checks	Alert when rate increases > 2× baseline — possible model regression or prompt drift

Latency Breakdown

Total end-to-end latency for an AI request has multiple distinct components. Alert on each separately — a spike in one component has different root causes than a spike in another.

Latency components to track

TTFT (Time to First Token) — server receives request to first streaming token; reflects model load and prompt processing time
Generation time — first token to last token; proportional to output length
Tool execution time — for agentic systems; time spent in tool calls outside the LLM
Total end-to-end — from user request to response complete; includes all above + network + your processing

Recommended alert thresholds

TTFT P95 > 3s for interactive use case → investigate model load or retry overhead
Total latency P95 > 2× P50 → tail latency problem; check for outlier inputs
Tool execution P95 > 5s → slow external API or database query; fix the tool, not the LLM
Latency spike correlating with 429 rate spike → rate limit is the cause; add backoff or route to secondary

Quality Signal Monitoring

Quality failures in AI systems do not produce error codes — the request succeeds (HTTP 200) but the response is wrong or unhelpful. Quality must be measured separately through indirect signals and sampling.

Quality signal	What it indicates	How to collect
User thumbs down / negative feedback	User found the response unhelpful, wrong, or inappropriate	Inline feedback widget; track rate per use case and per model version
Escalation rate	User escalated to human agent after AI response; AI failed to resolve	Track escalation events in your support workflow; correlate with AI conversation IDs
Task success rate	Percentage of agentic tasks completed successfully vs abandoned or errored	Instrument agent run outcomes; distinguish tool errors from quality failures
Format error rate	Model returned wrong format (e.g., JSON parse failed, required fields missing)	Validate structured outputs at application layer; log and count validation failures
Conversation restart rate	User started a new conversation immediately after previous one; implicit signal of failure	Detect short-session conversations followed by restart within 2 minutes

Observability Stack Recommendations

Langfuse (recommended for most teams)

Open source; self-hosted or cloud; full trace capture
Token cost tracking per model / use case / user
Prompt versioning and experiment comparison
SDK for Python / TypeScript; LiteLLM native integration
2025 status: most widely adopted open-source LLM observability platform

LangSmith (LangChain ecosystem)

SaaS; deep integration with LangChain and LangGraph
Trace visualisation for complex agentic workflows
Human annotation tools for quality labelling
Best choice if you are building on LangChain/LangGraph
Less suitable for non-LangChain architectures

Complement either with your existing infrastructure monitoring stack (Datadog, Grafana, Prometheus) for infrastructure metrics (host CPU, memory, network). AI observability tools cover LLM-specific signals; standard APM covers everything else. Do not try to make one tool do both.

SLO-Aligned Alerting

Alert on SLO burn rate, not just threshold crossings

A single high-latency request is noise. A sustained burn through your error budget is signal. SLO-aligned alerting fires when the rate of bad events is high enough to exhaust your error budget within a defined window (e.g., 1 hour at current burn rate = error budget exhausted in 24 hours). This prevents alert fatigue from transient spikes while catching sustained degradations before they exhaust SLO headroom.

Checklist: Do You Understand This?

Why does standard APM miss the most important signals in an AI system — name three AI-specific metrics it cannot capture.
What is TTFT and why should it be alerted on separately from total end-to-end latency?
Why does a guardrail trigger rate of zero warrant an alert — not just a high rate?
Name three quality signals that can be collected without asking users to rate responses.
What is the difference between Langfuse and LangSmith — and when would you choose each?
What is SLO burn rate alerting, and why is it preferable to simple threshold-crossing alerts for AI systems?