Incident Response with AI
AI is most useful in incident response before and after incidents — not during them. In an active incident, the cost of switching context to query an AI chatbot usually exceeds the benefit. Where AI genuinely helps is in preparation work: writing runbooks before incidents happen, and turning raw incident notes into structured post-mortems after.
When AI Helps vs When It Slows You Down
Before and after incidents (high value)
- Writing runbooks for known alert scenarios
- Generating post-mortem narratives from raw notes
- Drafting incident communication templates
- Creating on-call handoff checklists
- Analysing logs after the incident to confirm root cause
- Writing action items that are specific and verifiable
During active incidents (use sparingly)
- Context-switching to AI chat adds cognitive load under pressure
- AI has no visibility into your live systems
- AI cannot read your current dashboards or metrics
- Pasting large log volumes takes time you may not have
- Pre-written runbooks are faster than prompting in the moment
Exception: AI is useful during incidents for a specific task — looking up unfamiliar error codes, library behaviours, or cloud provider error responses that you do not have memorised. For targeted lookups, it is faster than documentation search.
Writing Runbooks with AI
The best time to write a runbook is during a quiet period, not during an incident. AI can turn a brief description of an alert scenario into a structured runbook in minutes.
Write a runbook for the following alert scenario. The audience is an on-call engineer who may not be familiar with this specific service.
Alert: [ALERT NAME AND CONDITION — e.g., "HighMemoryUsage: memory > 85% for 10 minutes"]
Service: [SERVICE NAME AND WHAT IT DOES]
Dependencies: [KEY DEPENDENCIES — e.g., "reads from PostgreSQL, publishes to SQS, calls payment-service"]
Past causes: [KNOWN ROOT CAUSES IF ANY]
Structure the runbook as:
## Severity Assessment (first 2 minutes)
- Is this user-facing? How to check.
- Severity: P1 / P2 / P3 criteria
## Diagnostic Steps (in order)
- Each step: what to check, exact command or dashboard to use
## Likely Causes and Fixes
- Each cause: evidence pattern, fix action, expected time to resolve
## Escalation
- When to page the team lead or service owner
## Communication
- Status page update template
- Stakeholder notification template
After your next real incident, update the "past causes" section. A runbook that reflects actual incidents is far more useful than one that only reflects theory.
Blameless Post-Mortems
AI drafts the narrative structure. You must verify the timeline and own the action items — AI will fill gaps in your notes with plausible-sounding content that may be wrong.
Write a blameless post-mortem for this incident. Do not assign blame to individuals — describe system and process failures only.
Incident title: [ONE-LINE DESCRIPTION]
Duration: [START TIME] to [END TIME] ([TOTAL DURATION])
Severity: [P1 / P2 / P3]
Customer impact: [DESCRIBE WHO WAS AFFECTED AND HOW]
Timeline (paste raw notes, Slack exports, or alert history — do not clean them up):
[PASTE RAW NOTES]
What we believe was the root cause: [YOUR CURRENT HYPOTHESIS]
Output structure:
1. Summary (3 sentences max)
2. Timeline (clean chronological format from raw notes)
3. Root Cause (technical explanation)
4. Contributing Factors (systemic factors that made this possible or worse)
5. What Went Well (things that limited impact)
6. Action Items (each must have: specific task / owner role / deadline / what it prevents)
For action items: be specific. "Add health check to payment-service before calling it" not "improve resilience".
The "What Went Well" section is often skipped but is important — it identifies practices worth reinforcing.
Incident Communication Templates
Generate templates once; fill them in quickly during incidents. AI should draft these before incidents, not during them.
Write three incident communication templates for a [DESCRIBE YOUR PRODUCT — e.g., "B2B SaaS API platform"]:
1. INITIAL NOTICE: sent within 5 minutes of declaring an incident. Should acknowledge the issue, state what we know and do not know, and set next-update time. Max 4 sentences.
2. UPDATE: sent every 30 minutes during an incident. Should show progress, state current hypothesis, and set next-update time. Uses [PLACEHOLDER] format for fields to fill in.
3. RESOLUTION: sent when service is restored. Should confirm resolution, summarise impact duration, and commit to post-mortem sharing. Max 5 sentences.
Tone: professional, direct, no corporate jargon. Acknowledge impact to customers plainly.
Store these templates somewhere accessible during incidents — a pinned Slack message, runbook wiki, or your on-call tool.
Severity Classification Help
Help me define a severity classification framework for a [DESCRIBE COMPANY SIZE AND PRODUCT TYPE — e.g., "50-person B2B SaaS company with enterprise customers and an uptime SLA"].
Create a severity table with P1–P4 levels defining for each:
- Customer impact threshold (who is affected and how severely)
- Response time target (time to first responder acknowledgement)
- Resolution target
- Who must be notified (on-call, team lead, exec, customer success)
- Whether a post-mortem is required
Also give me 2 example scenarios per severity level so the classification is unambiguous in practice.
Example scenarios per severity level are essential — abstract definitions always leave room for disagreement during incidents.
Checklist: Do You Understand This?
- Why is AI generally more useful before and after incidents than during them?
- What is the one situation where consulting AI during an active incident is justified?
- What is the most important thing to verify when AI generates a post-mortem from your raw notes?
- What makes an action item "specific" in a post-mortem, and why does this matter?
- Write a severity classification prompt for a consumer app with no SLA but high daily active users.
- Why should incident communication templates be written before incidents, not during them?