Intermediate

Error Handling & Observability

AI workflows fail in ways that traditional software does not: LLM calls time out, tool results are malformed, agents reason in circles, and costs spike unexpectedly. Without structured error handling and observability, diagnosing these failures is guesswork. This page covers the error categories specific to LLM systems, retry patterns, and the observability stack needed to understand what your AI is doing in production.

Request

→

Trace Spans

Log each step

→

Error Detected

Classify + retry

→

Alert / Log

Langfuse / Datadog

→

Fix + Redeploy

Observability pipeline — trace every step, alert on failure patterns

Error Categories in AI Workflows

Category	Examples	Retriable?	Handling approach
Transient API errors	429 rate limit, 503 overload, network timeout	Yes	Exponential backoff with jitter
Context / input errors	Token limit exceeded, invalid message format	No (without modification)	Truncate / summarise context, then retry
Output format errors	JSON parse failure, schema mismatch, missing required fields	Yes (with prompt adjustment)	Retry with explicit format reminder; max 2 retries
Tool execution errors	API key invalid, resource not found, permission denied	No	Return structured error to agent; agent decides next step
Logic / reasoning errors	Agent stuck in loop, wrong tool selected, hallucinated args	Situational	Step limit + circuit breaker; human escalation
Content policy errors	Provider refusal, guardrail block, finish_reason=content_filter	No	Log reason, return graceful message, do not retry same prompt

Retry Patterns

Not all errors should be retried, and not all retries should be identical. Retrying the same request immediately on a rate limit just amplifies the problem. The key principle: modify before retry — never retry the exact same call after a non-transient error.

Exponential backoff with jitter (for 429 / 503)

Wait time = min(cap, base × 2^attempt) + random(0, jitter)

Base: 1s; Cap: 60s; Jitter: ±20% — prevents thundering herd when many workers retry simultaneously
Libraries: tenacity (Python), retry (JS) — both support exponential backoff natively
Maximum 3–5 retries for most API calls; log each retry attempt

Retry with modification (for output format errors)

On JSON parse failure: retry with the original response appended and instruction "The above was not valid JSON. Please respond with only valid JSON matching this schema: [schema]"
Limit to 2 correction retries — if it fails twice, the model may not be capable of the format; fall back or escalate
Track format error rate by model — high rates indicate a schema or model capability mismatch

Never-retry errors

401 Unauthorized — retrying burns quota while the key remains invalid
Content policy refusal — the model has determined the request is policy-violating; the identical request will be refused again
Argument validation failures on tool calls — the args are wrong; the same call will fail again

Tool Error Handling

When a tool call fails, the result must still be returned to the agent — never leave a tool call result empty. An empty result causes the agent to assume the tool succeeded silently, leading to downstream reasoning errors.

Tool error response pattern:

Always return a structured result: { "success": false, "error": "...", "code": "RESOURCE_NOT_FOUND" }
Include actionable context: "User ID 12345 was not found. Valid IDs are integers between 1 and 99999."
Do not expose internal stack traces in tool results — they add noise to the context window
The agent decides the next step: it may retry with a corrected argument, try an alternative tool, or escalate to the user

The Observability Stack

Traditional APM (Application Performance Monitoring) tools track HTTP latency and error rates. LLM observability requires a richer semantic model: traces that capture what was in the prompt, what the model output, which tools were called, and how much it all cost.

What a complete LLM trace captures:

Trace ID and session / thread ID
Every LLM call: model, system prompt hash, user message, response, finish reason
Token counts: prompt tokens, completion tokens, total — per call and per session
Latency: time-to-first-token, total generation time, tool execution time

Every tool call: name, arguments (sanitised), result, success/failure
Retrieval steps: query, chunks retrieved, scores
Guardrail checks: which ran, outcome, latency
Cost estimate: per call and per session (USD)

Observability Platforms

LangSmith (by LangChain)

Deep integration with LangChain and LangGraph — near-zero instrumentation effort for LangChain users
Automatic tracing of all LLM calls, tool calls, and retrieval steps within a chain/graph
Evaluation datasets, prompt versioning, annotation queues for human labelling
LangSmith demonstrated near-zero overhead in benchmarks — suitable for performance-critical production use
Cloud-hosted (SaaS); self-hosted option in enterprise plan

Langfuse (open-source)

Self-hostable (Docker, Kubernetes) — data stays on your infrastructure
Framework-agnostic: SDKs for Python, JS, + OpenTelemetry integration
Full trace tree: LLM calls, tool calls, retrieval, custom spans — all nested correctly
Prompt management: version, test, and deploy prompts from the UI
Evaluation: run evals on trace datasets, score traces with LLM-as-judge
Growing fast: 20k+ GitHub stars, YC W23 company as of 2025

OpenTelemetry (OTEL) — the emerging standard

Industry converging on OTEL as the standard for collecting agent telemetry (2025)
Instrument once with OTEL; export to any backend (Langfuse, Jaeger, Datadog, Honeycomb)
Semantic conventions for LLM spans being standardised (gen_ai.* attributes)
Best for teams that already have an observability stack and want LLM traces alongside infrastructure traces

Key Metrics to Track

Operational metrics

TTFT: time-to-first-token (user-perceived responsiveness)
Total latency: end-to-end response time including tools
Token usage: per call, per session, per user — spot anomalies
Cost per task: USD cost to complete one end-user action
Error rate by category: 429s, format errors, tool failures
Step count distribution: how many tool calls does a typical agent task take?

Quality metrics

Task success rate: % of tasks completed without human escalation
Hallucination rate: % of responses flagged as ungrounded (from output guardrails)
Format error rate: % of LLM calls that fail structured output parsing
Repeat tool call rate: % of tasks where the same tool is called 3+ times (stuck loop proxy)
User satisfaction: thumbs up/down, edit rate, repeat question rate

Alerting Thresholds

Set alerts before you need them. The following thresholds are starting points — tune them to your specific system after observing baseline behaviour for 2–4 weeks.

P99 latency > 2× baseline → alert
429 error rate > 5% over 5-minute window → alert (approaching rate limit)
Cost per hour > 2× daily average/24 → alert (runaway spending)
Task success rate drops > 10% from baseline → alert (model or prompt regression)
Any agent run exceeding 70% of configured step limit → warn (approaching stuck loop)
Any agent run exceeding 70% of context window → warn (approaching overflow)

Logging Best Practices

Log these

Every LLM call with model, token counts, finish reason, latency
Every tool call with name, sanitised args, result type, success/failure
Every guardrail check and its outcome
Session start/end with total cost, total steps, final status
All errors with category, code, and whether retry was attempted

Never log these

Full user messages in production without explicit consent + retention policy
API keys, credentials, or secrets that appear in tool arguments
PII (names, emails, SSNs) — sanitise before logging
Full prompt content at DEBUG level in production (token-expensive to store)

Checklist: Do You Understand This?

What are the six error categories specific to AI workflows, and which ones are retriable without modification?
How does exponential backoff with jitter work, and why is jitter necessary?
Why should a tool call always return a result even when the tool fails?
What is the key difference between LangSmith and Langfuse, and when would you choose each?
What does a complete LLM trace capture that traditional APM tools do not?
What five alerts would you set up first for a new production AI workflow?