🧠 All Things AI
Intermediate

Error Handling & Observability

AI workflows fail in ways that traditional software does not: LLM calls time out, tool results are malformed, agents reason in circles, and costs spike unexpectedly. Without structured error handling and observability, diagnosing these failures is guesswork. This page covers the error categories specific to LLM systems, retry patterns, and the observability stack needed to understand what your AI is doing in production.

Error Categories in AI Workflows

CategoryExamplesRetriable?Handling approach
Transient API errors429 rate limit, 503 overload, network timeoutYesExponential backoff with jitter
Context / input errorsToken limit exceeded, invalid message formatNo (without modification)Truncate / summarise context, then retry
Output format errorsJSON parse failure, schema mismatch, missing required fieldsYes (with prompt adjustment)Retry with explicit format reminder; max 2 retries
Tool execution errorsAPI key invalid, resource not found, permission deniedNoReturn structured error to agent; agent decides next step
Logic / reasoning errorsAgent stuck in loop, wrong tool selected, hallucinated argsSituationalStep limit + circuit breaker; human escalation
Content policy errorsProvider refusal, guardrail block, finish_reason=content_filterNoLog reason, return graceful message, do not retry same prompt

Retry Patterns

Not all errors should be retried, and not all retries should be identical. Retrying the same request immediately on a rate limit just amplifies the problem. The key principle: modify before retry — never retry the exact same call after a non-transient error.

Exponential backoff with jitter (for 429 / 503)

Wait time = min(cap, base × 2^attempt) + random(0, jitter)

  • Base: 1s; Cap: 60s; Jitter: ±20% — prevents thundering herd when many workers retry simultaneously
  • Libraries: tenacity (Python), retry (JS) — both support exponential backoff natively
  • Maximum 3–5 retries for most API calls; log each retry attempt

Retry with modification (for output format errors)

  • On JSON parse failure: retry with the original response appended and instruction "The above was not valid JSON. Please respond with only valid JSON matching this schema: [schema]"
  • Limit to 2 correction retries — if it fails twice, the model may not be capable of the format; fall back or escalate
  • Track format error rate by model — high rates indicate a schema or model capability mismatch

Never-retry errors

  • 401 Unauthorized — retrying burns quota while the key remains invalid
  • Content policy refusal — the model has determined the request is policy-violating; the identical request will be refused again
  • Argument validation failures on tool calls — the args are wrong; the same call will fail again

Tool Error Handling

When a tool call fails, the result must still be returned to the agent — never leave a tool call result empty. An empty result causes the agent to assume the tool succeeded silently, leading to downstream reasoning errors.

Tool error response pattern:

  • Always return a structured result: { "success": false, "error": "...", "code": "RESOURCE_NOT_FOUND" }
  • Include actionable context: "User ID 12345 was not found. Valid IDs are integers between 1 and 99999."
  • Do not expose internal stack traces in tool results — they add noise to the context window
  • The agent decides the next step: it may retry with a corrected argument, try an alternative tool, or escalate to the user

The Observability Stack

Traditional APM (Application Performance Monitoring) tools track HTTP latency and error rates. LLM observability requires a richer semantic model: traces that capture what was in the prompt, what the model output, which tools were called, and how much it all cost.

What a complete LLM trace captures:

  • Trace ID and session / thread ID
  • Every LLM call: model, system prompt hash, user message, response, finish reason
  • Token counts: prompt tokens, completion tokens, total — per call and per session
  • Latency: time-to-first-token, total generation time, tool execution time
  • Every tool call: name, arguments (sanitised), result, success/failure
  • Retrieval steps: query, chunks retrieved, scores
  • Guardrail checks: which ran, outcome, latency
  • Cost estimate: per call and per session (USD)

Observability Platforms

LangSmith (by LangChain)

  • Deep integration with LangChain and LangGraph — near-zero instrumentation effort for LangChain users
  • Automatic tracing of all LLM calls, tool calls, and retrieval steps within a chain/graph
  • Evaluation datasets, prompt versioning, annotation queues for human labelling
  • LangSmith demonstrated near-zero overhead in benchmarks — suitable for performance-critical production use
  • Cloud-hosted (SaaS); self-hosted option in enterprise plan

Langfuse (open-source)

  • Self-hostable (Docker, Kubernetes) — data stays on your infrastructure
  • Framework-agnostic: SDKs for Python, JS, + OpenTelemetry integration
  • Full trace tree: LLM calls, tool calls, retrieval, custom spans — all nested correctly
  • Prompt management: version, test, and deploy prompts from the UI
  • Evaluation: run evals on trace datasets, score traces with LLM-as-judge
  • Growing fast: 20k+ GitHub stars, YC W23 company as of 2025

OpenTelemetry (OTEL) — the emerging standard

  • Industry converging on OTEL as the standard for collecting agent telemetry (2025)
  • Instrument once with OTEL; export to any backend (Langfuse, Jaeger, Datadog, Honeycomb)
  • Semantic conventions for LLM spans being standardised (gen_ai.* attributes)
  • Best for teams that already have an observability stack and want LLM traces alongside infrastructure traces

Key Metrics to Track

Operational metrics

  • TTFT: time-to-first-token (user-perceived responsiveness)
  • Total latency: end-to-end response time including tools
  • Token usage: per call, per session, per user — spot anomalies
  • Cost per task: USD cost to complete one end-user action
  • Error rate by category: 429s, format errors, tool failures
  • Step count distribution: how many tool calls does a typical agent task take?

Quality metrics

  • Task success rate: % of tasks completed without human escalation
  • Hallucination rate: % of responses flagged as ungrounded (from output guardrails)
  • Format error rate: % of LLM calls that fail structured output parsing
  • Repeat tool call rate: % of tasks where the same tool is called 3+ times (stuck loop proxy)
  • User satisfaction: thumbs up/down, edit rate, repeat question rate

Alerting Thresholds

Set alerts before you need them. The following thresholds are starting points — tune them to your specific system after observing baseline behaviour for 2–4 weeks.

  • P99 latency > 2× baseline → alert
  • 429 error rate > 5% over 5-minute window → alert (approaching rate limit)
  • Cost per hour > 2× daily average/24 → alert (runaway spending)
  • Task success rate drops > 10% from baseline → alert (model or prompt regression)
  • Any agent run exceeding 70% of configured step limit → warn (approaching stuck loop)
  • Any agent run exceeding 70% of context window → warn (approaching overflow)

Logging Best Practices

Log these

  • Every LLM call with model, token counts, finish reason, latency
  • Every tool call with name, sanitised args, result type, success/failure
  • Every guardrail check and its outcome
  • Session start/end with total cost, total steps, final status
  • All errors with category, code, and whether retry was attempted

Never log these

  • Full user messages in production without explicit consent + retention policy
  • API keys, credentials, or secrets that appear in tool arguments
  • PII (names, emails, SSNs) — sanitise before logging
  • Full prompt content at DEBUG level in production (token-expensive to store)

Checklist: Do You Understand This?

  • What are the six error categories specific to AI workflows, and which ones are retriable without modification?
  • How does exponential backoff with jitter work, and why is jitter necessary?
  • Why should a tool call always return a result even when the tool fails?
  • What is the key difference between LangSmith and Langfuse, and when would you choose each?
  • What does a complete LLM trace capture that traditional APM tools do not?
  • What five alerts would you set up first for a new production AI workflow?