Beginner

Observability with AI

AI is genuinely useful for observability work — not to replace alerting platforms, but to accelerate the two most time-consuming tasks: making sense of unfamiliar logs and writing queries for metrics platforms you don't have memorised. The critical limitation is that AI has no knowledge of your system's normal behaviour or history, so you must always provide that context explicitly.

Log Analysis Prompts

Paste logs directly into AI chat. The single most effective improvement is adding a line telling AI what you already know — otherwise it explains things you have already ruled out.

1. Root Cause Analysis from Logs

I am debugging an incident in a [DESCRIBE SYSTEM — e.g., "Python Django API serving mobile clients"].

What I already know: [WHAT YOU HAVE RULED OUT — e.g., "database is healthy, latency spike started at 14:32 UTC"]

Here are the relevant logs from that time window:

[PASTE LOGS]

Based on these logs:

1. What is the most likely root cause?

2. What evidence in the logs supports this?

3. What would you check next to confirm?

4. What could cause this pattern that the logs alone cannot tell me?

Item 4 is key — it forces AI to be explicit about the limits of what logs alone can show, preventing false confidence in the diagnosis.

2. Error Pattern Identification

Here are 200 lines of logs from the past hour. Group all errors by type and frequency. For each group:

- The error message pattern (use [VARIABLE] for dynamic parts)

- Count of occurrences

- Whether this appears to be the same root cause or multiple distinct issues

- Severity: is this user-visible, data-corrupting, or infrastructure-only?

[PASTE LOGS]

Useful when you have a large log dump and need to prioritise what to investigate first.

Observability Query Generation

Always provide your platform name and any schema or metric names you know. AI will hallucinate metric names if you do not anchor it to your actual data.

Write a [PLATFORM — Datadog / Prometheus / CloudWatch / Grafana] query to [DESCRIBE WHAT YOU WANT TO MEASURE].

Context:

- My service emits these metrics/log fields: [LIST KNOWN METRIC NAMES]

- I want to detect: [e.g., "p99 latency exceeding 2 seconds, grouped by endpoint"]

- Time window: [e.g., "rolling 5-minute window"]

After the query, explain what each function does so I can adapt it.

The explanation request is essential — without it you cannot modify the query when your metric names differ.

Alert Triage

AI helps write alert runbooks and triage guides that on-call engineers can follow without deep system knowledge.

Write an on-call runbook for the following alert:

Alert name: [ALERT NAME]

What it fires on: [e.g., "error rate > 5% for 5 minutes"]

System: [DESCRIBE YOUR SERVICE AND ITS DEPENDENCIES]

Common causes we have seen: [LIST KNOWN CAUSES]

The runbook should include:

1. First 60 seconds: what to check immediately to assess severity

2. Diagnostic steps in order, with the command or query for each

3. Likely causes mapped to their fixes

4. Escalation criteria: when to wake someone else up

5. Rollback procedure if a recent deployment is the cause

Update the "common causes we have seen" section after each real incident — this is what makes the runbook genuinely useful over time.

AI-Powered Observability Tools (2025–2026)

Tool	AI Capability	Best For	Limitation
Datadog AI	Natural-language query (Bits AI), anomaly detection, alert correlation	Teams already on Datadog; broad platform coverage	Query suggestions sometimes miss custom metric names
Dynatrace Davis AI	Automated root cause analysis, dependency mapping, problem grouping	Enterprise with complex service graphs; automatic baselining	Expensive; requires Dynatrace OneAgent deployed everywhere
Grafana AI Assist	Natural-language to PromQL/LogQL, dashboard generation	Teams running self-hosted Prometheus/Loki stack	Requires knowing your metric/label names to correct hallucinations
New Relic Grok	Conversational NRQL query builder, anomaly summarisation	Teams new to NRQL who want natural-language entry point	Generated NRQL still needs validation before use in dashboards
Honeycomb AI	Natural-language to BubbleUp trace analysis	High-cardinality distributed trace debugging	Strong only if you have structured trace data already
Claude / ChatGPT	Log analysis, query generation, runbook writing, post-mortem drafting	Ad-hoc investigation; platforms AI does not natively integrate with	No access to live metrics; all context must be pasted manually

Post-Mortem Generation

Write a blameless post-mortem for the following incident. Use the standard format: Summary, Timeline, Root Cause, Contributing Factors, Impact, Action Items.

Incident summary: [DESCRIBE WHAT HAPPENED IN 2-3 SENTENCES]

Timeline (paste your notes, Slack messages, or alert history):

[PASTE RAW TIMELINE NOTES]

Rules:

- Blameless: describe system failures, not individual failures

- Action items must be specific and assignable, not vague (e.g., "Add circuit breaker to payment-service" not "improve resilience")

- For each action item: what will it prevent and by when should it be done

AI excels at turning rough notes into a clean narrative. You must review and correct the timeline — AI will smooth over gaps that matter.

What AI Cannot Do

AI handles well

Pattern-matching known error types in logs
Generating platform queries from natural language
Writing runbook structure and diagnostic steps
Drafting post-mortem narrative from raw notes
Explaining unfamiliar stack traces or error codes

AI cannot do this without you

Know your system's normal baseline — you must provide it
Know your deployment history — it cannot see recent changes
Access live metrics or dashboards — paste or describe them
Distinguish business-critical vs non-critical errors — you decide
Confirm a root cause — it can only narrow hypotheses

Checklist: Do You Understand This?

Why is the "what I already know" section important in a log analysis prompt?
What must you provide when asking AI to generate a Prometheus or Datadog query — and why?
Write an alert runbook prompt for a "database connection pool exhausted" alert in a Node.js service.
What is the key difference between Dynatrace Davis and using Claude/ChatGPT for observability work?
Why does AI post-mortem generation require you to verify the timeline carefully?
Name three things AI genuinely cannot do in an active incident without your input.