Observability with AI
AI is genuinely useful for observability work — not to replace alerting platforms, but to accelerate the two most time-consuming tasks: making sense of unfamiliar logs and writing queries for metrics platforms you don't have memorised. The critical limitation is that AI has no knowledge of your system's normal behaviour or history, so you must always provide that context explicitly.
Log Analysis Prompts
Paste logs directly into AI chat. The single most effective improvement is adding a line telling AI what you already know — otherwise it explains things you have already ruled out.
1. Root Cause Analysis from Logs
I am debugging an incident in a [DESCRIBE SYSTEM — e.g., "Python Django API serving mobile clients"].
What I already know: [WHAT YOU HAVE RULED OUT — e.g., "database is healthy, latency spike started at 14:32 UTC"]
Here are the relevant logs from that time window:
[PASTE LOGS]
Based on these logs:
1. What is the most likely root cause?
2. What evidence in the logs supports this?
3. What would you check next to confirm?
4. What could cause this pattern that the logs alone cannot tell me?
Item 4 is key — it forces AI to be explicit about the limits of what logs alone can show, preventing false confidence in the diagnosis.
2. Error Pattern Identification
Here are 200 lines of logs from the past hour. Group all errors by type and frequency. For each group:
- The error message pattern (use [VARIABLE] for dynamic parts)
- Count of occurrences
- Whether this appears to be the same root cause or multiple distinct issues
- Severity: is this user-visible, data-corrupting, or infrastructure-only?
[PASTE LOGS]
Useful when you have a large log dump and need to prioritise what to investigate first.
Observability Query Generation
Always provide your platform name and any schema or metric names you know. AI will hallucinate metric names if you do not anchor it to your actual data.
Write a [PLATFORM — Datadog / Prometheus / CloudWatch / Grafana] query to [DESCRIBE WHAT YOU WANT TO MEASURE].
Context:
- My service emits these metrics/log fields: [LIST KNOWN METRIC NAMES]
- I want to detect: [e.g., "p99 latency exceeding 2 seconds, grouped by endpoint"]
- Time window: [e.g., "rolling 5-minute window"]
After the query, explain what each function does so I can adapt it.
The explanation request is essential — without it you cannot modify the query when your metric names differ.
Alert Triage
AI helps write alert runbooks and triage guides that on-call engineers can follow without deep system knowledge.
Write an on-call runbook for the following alert:
Alert name: [ALERT NAME]
What it fires on: [e.g., "error rate > 5% for 5 minutes"]
System: [DESCRIBE YOUR SERVICE AND ITS DEPENDENCIES]
Common causes we have seen: [LIST KNOWN CAUSES]
The runbook should include:
1. First 60 seconds: what to check immediately to assess severity
2. Diagnostic steps in order, with the command or query for each
3. Likely causes mapped to their fixes
4. Escalation criteria: when to wake someone else up
5. Rollback procedure if a recent deployment is the cause
Update the "common causes we have seen" section after each real incident — this is what makes the runbook genuinely useful over time.
AI-Powered Observability Tools (2025–2026)
| Tool | AI Capability | Best For | Limitation |
|---|---|---|---|
| Datadog AI | Natural-language query (Bits AI), anomaly detection, alert correlation | Teams already on Datadog; broad platform coverage | Query suggestions sometimes miss custom metric names |
| Dynatrace Davis AI | Automated root cause analysis, dependency mapping, problem grouping | Enterprise with complex service graphs; automatic baselining | Expensive; requires Dynatrace OneAgent deployed everywhere |
| Grafana AI Assist | Natural-language to PromQL/LogQL, dashboard generation | Teams running self-hosted Prometheus/Loki stack | Requires knowing your metric/label names to correct hallucinations |
| New Relic Grok | Conversational NRQL query builder, anomaly summarisation | Teams new to NRQL who want natural-language entry point | Generated NRQL still needs validation before use in dashboards |
| Honeycomb AI | Natural-language to BubbleUp trace analysis | High-cardinality distributed trace debugging | Strong only if you have structured trace data already |
| Claude / ChatGPT | Log analysis, query generation, runbook writing, post-mortem drafting | Ad-hoc investigation; platforms AI does not natively integrate with | No access to live metrics; all context must be pasted manually |
Post-Mortem Generation
Write a blameless post-mortem for the following incident. Use the standard format: Summary, Timeline, Root Cause, Contributing Factors, Impact, Action Items.
Incident summary: [DESCRIBE WHAT HAPPENED IN 2-3 SENTENCES]
Timeline (paste your notes, Slack messages, or alert history):
[PASTE RAW TIMELINE NOTES]
Rules:
- Blameless: describe system failures, not individual failures
- Action items must be specific and assignable, not vague (e.g., "Add circuit breaker to payment-service" not "improve resilience")
- For each action item: what will it prevent and by when should it be done
AI excels at turning rough notes into a clean narrative. You must review and correct the timeline — AI will smooth over gaps that matter.
What AI Cannot Do
AI handles well
- Pattern-matching known error types in logs
- Generating platform queries from natural language
- Writing runbook structure and diagnostic steps
- Drafting post-mortem narrative from raw notes
- Explaining unfamiliar stack traces or error codes
AI cannot do this without you
- Know your system's normal baseline — you must provide it
- Know your deployment history — it cannot see recent changes
- Access live metrics or dashboards — paste or describe them
- Distinguish business-critical vs non-critical errors — you decide
- Confirm a root cause — it can only narrow hypotheses
Checklist: Do You Understand This?
- Why is the "what I already know" section important in a log analysis prompt?
- What must you provide when asking AI to generate a Prometheus or Datadog query — and why?
- Write an alert runbook prompt for a "database connection pool exhausted" alert in a Node.js service.
- What is the key difference between Dynatrace Davis and using Claude/ChatGPT for observability work?
- Why does AI post-mortem generation require you to verify the timeline carefully?
- Name three things AI genuinely cannot do in an active incident without your input.