🧠 All Things AI
Advanced

Monitoring & Alerting for AI Systems

Standard APM (application performance monitoring) covers infrastructure and error rates but misses what matters most in AI systems: response quality, cost per request, token patterns, and guardrail behaviour. AI monitoring requires a separate layer that captures LLM-specific signals and connects them to business outcomes.

AI-Specific Metrics Beyond Standard APM

MetricWhat it measuresAlert threshold
Input tokens per requestPrompt length; growing input tokens = growing cost + latencyAlert when P95 input tokens doubles week-over-week
Output tokens per requestResponse length; high output tokens = verbose responses or no max_tokens limitAlert when P95 output tokens exceeds configured max_tokens by > 10%
Cost per requestActual spend per LLM call; most important unit for FinOpsAlert when use-case daily cost exceeds 2× rolling 7-day average
Cache hit ratePercentage of requests served from prompt or semantic cacheAlert when cache hit rate drops > 20% from baseline (indicates prompt structure change)
Guardrail trigger ratePercentage of requests blocked or modified by input/output policyAlert on sudden spike (> 3× baseline) — possible coordinated attack; alert on zero (guardrails may be broken)
Tool call count per requestFor agentic systems: number of tool invocations per agent runAlert when P95 tool calls exceeds configured step limit
Retry ratePercentage of requests that required at least one retryAlert at > 5% — indicates provider instability or rate limit pressure
Hallucination flag ratePercentage of responses flagged by automated quality checksAlert when rate increases > 2× baseline — possible model regression or prompt drift

Latency Breakdown

Total end-to-end latency for an AI request has multiple distinct components. Alert on each separately — a spike in one component has different root causes than a spike in another.

Latency components to track

  • TTFT (Time to First Token) — server receives request to first streaming token; reflects model load and prompt processing time
  • Generation time — first token to last token; proportional to output length
  • Tool execution time — for agentic systems; time spent in tool calls outside the LLM
  • Total end-to-end — from user request to response complete; includes all above + network + your processing

Recommended alert thresholds

  • TTFT P95 > 3s for interactive use case → investigate model load or retry overhead
  • Total latency P95 > 2× P50 → tail latency problem; check for outlier inputs
  • Tool execution P95 > 5s → slow external API or database query; fix the tool, not the LLM
  • Latency spike correlating with 429 rate spike → rate limit is the cause; add backoff or route to secondary

Quality Signal Monitoring

Quality failures in AI systems do not produce error codes — the request succeeds (HTTP 200) but the response is wrong or unhelpful. Quality must be measured separately through indirect signals and sampling.

Quality signalWhat it indicatesHow to collect
User thumbs down / negative feedbackUser found the response unhelpful, wrong, or inappropriateInline feedback widget; track rate per use case and per model version
Escalation rateUser escalated to human agent after AI response; AI failed to resolveTrack escalation events in your support workflow; correlate with AI conversation IDs
Task success ratePercentage of agentic tasks completed successfully vs abandoned or erroredInstrument agent run outcomes; distinguish tool errors from quality failures
Format error rateModel returned wrong format (e.g., JSON parse failed, required fields missing)Validate structured outputs at application layer; log and count validation failures
Conversation restart rateUser started a new conversation immediately after previous one; implicit signal of failureDetect short-session conversations followed by restart within 2 minutes

Observability Stack Recommendations

Langfuse (recommended for most teams)

  • Open source; self-hosted or cloud; full trace capture
  • Token cost tracking per model / use case / user
  • Prompt versioning and experiment comparison
  • SDK for Python / TypeScript; LiteLLM native integration
  • 2025 status: most widely adopted open-source LLM observability platform

LangSmith (LangChain ecosystem)

  • SaaS; deep integration with LangChain and LangGraph
  • Trace visualisation for complex agentic workflows
  • Human annotation tools for quality labelling
  • Best choice if you are building on LangChain/LangGraph
  • Less suitable for non-LangChain architectures

Complement either with your existing infrastructure monitoring stack (Datadog, Grafana, Prometheus) for infrastructure metrics (host CPU, memory, network). AI observability tools cover LLM-specific signals; standard APM covers everything else. Do not try to make one tool do both.

SLO-Aligned Alerting

Alert on SLO burn rate, not just threshold crossings

A single high-latency request is noise. A sustained burn through your error budget is signal. SLO-aligned alerting fires when the rate of bad events is high enough to exhaust your error budget within a defined window (e.g., 1 hour at current burn rate = error budget exhausted in 24 hours). This prevents alert fatigue from transient spikes while catching sustained degradations before they exhaust SLO headroom.

Checklist: Do You Understand This?

  • Why does standard APM miss the most important signals in an AI system — name three AI-specific metrics it cannot capture.
  • What is TTFT and why should it be alerted on separately from total end-to-end latency?
  • Why does a guardrail trigger rate of zero warrant an alert — not just a high rate?
  • Name three quality signals that can be collected without asking users to rate responses.
  • What is the difference between Langfuse and LangSmith — and when would you choose each?
  • What is SLO burn rate alerting, and why is it preferable to simple threshold-crossing alerts for AI systems?