Post-Incident Review
A post-incident review (PIR) — sometimes called a post-mortem or retrospective — is a structured analysis conducted after an AI incident is resolved. Its purpose is not to assign blame, but to understand what happened, why it happened, and what must change so it does not happen again. For AI systems, the PIR must account for causes that are inherently probabilistic, distributed across training-time and deployment-time decisions, and often the result of systemic gaps rather than individual errors.
Blameless Culture in AI Incidents
A blameless postmortem culture, popularised by Google SRE, is even more important for AI incidents than for software incidents. This is because AI failures frequently arise from:
- Complex training pipelines with many contributors, none of whom individually caused the failure
- Statistical behaviour that no single engineer could have predicted from code review alone
- Data labelling or collection decisions made months before the incident by people not involved in deployment
- Organisational decisions (cut scope of evaluation, skip bias testing to meet deadline) that look reasonable at the time but contributed to the failure
Blameless does not mean consequence-free. If policies were deliberately violated or reasonable precautions were knowingly skipped, accountability is appropriate. Blameless means the review focuses on systems,processes, and decisions — not on individual people as the root cause.
Five-Whys Adapted for AI Failures
The five-whys technique — asking "why?" repeatedly until a root cause is reached — must be adapted for AI systems to avoid stopping prematurely at proximate technical causes.
| Why level | Proximate answer | AI-adapted deeper question |
|---|---|---|
| Why 1 | The model produced incorrect outputs | What type of failure? Hallucination, drift, bias, adversarial manipulation? |
| Why 2 | The model was not validated on this input type | Was this input type foreseeable? Was it in scope? Did we have test data for it? |
| Why 3 | The evaluation process did not include this scenario | Why not? Was it an oversight, a resource constraint, or a process gap? |
| Why 4 | There was no checklist item requiring evaluation coverage of edge cases | Is this a single checklist gap or a systematic gap in the evaluation process? |
| Why 5 | The evaluation process was not reviewed against the AI risk taxonomy | Root cause: the AI governance framework did not connect risk taxonomy to evaluation requirements |
Systemic vs One-Off Causes
Every PIR must determine whether the cause was systemic (affects the full AI development lifecycle and will produce more incidents if not addressed) or one-off (an isolated circumstance unlikely to recur). Getting this wrong in either direction is costly: treating systemic causes as one-offs means the incident recurs; treating one-offs as systemic triggers expensive process overhaul that provides no benefit.
Indicators of a systemic cause
- Similar incidents have occurred before (check the incident register)
- The root cause is a process, policy, or tooling gap that affects all AI systems — not just this one
- The failure would have occurred with any model trained under the same process
- Multiple teams reported the same type of near-miss in the past 12 months
Indicators of a one-off cause
- The failure required a rare combination of circumstances that are unlikely to recur
- It was caused by a specific external event (a sudden change in user behaviour, a third-party data outage)
- No similar incidents in the history of this or comparable systems
- The root cause was already addressed as part of remediation with no further process changes needed
PIR Process and Output
PIR meeting participants
- Model owner (required)
- On-call responders who handled the incident
- Representatives from teams whose systems or decisions contributed to the failure
- AI governance function (required for P1/P2 incidents)
- Legal/compliance if regulatory obligations were triggered
- A PIR facilitator who was not directly involved in the incident — to maintain objectivity
PIR document structure
- Incident summary: What happened, when, who was affected, severity classification
- Timeline: Detailed sequence of events from first signal to resolution
- Impact assessment: Quantified harm: number of affected individuals, financial impact, regulatory exposure, reputational damage
- Root cause analysis: Five-whys or equivalent; contributing factors; systemic vs one-off determination
- What went well: Detection mechanisms, response actions, and decisions that worked as intended
- Action items: Specific, owned, time-bound improvements — see below
- Sign-off: AI risk owner and governance function confirm the PIR is complete and action items are tracked
Action Items: Turning PIRs into Governance Improvements
A PIR that identifies systemic causes but does not produce binding action items is a waste of time. Every systemic root cause must map to at least one action item with:
| Field | Content |
|---|---|
| Action | Specific, observable change — not "improve bias testing" but "add demographic parity gate to CI/CD pipeline for all classification models by Q3" |
| Owner | Named individual, not a team. One person is accountable. |
| Due date | Specific date. P1 systemic fixes: 30 days. P2: 60 days. Others: 90 days default. |
| Verification | How will completion be verified? Code review merged? New test suite passing? Policy document reviewed and signed off? |
| Linked incidents | If this action item addresses the root cause of multiple incidents, link them all — useful for tracking whether the fix actually worked |
Tracking Recurrence
The ultimate measure of a PIR's effectiveness is whether the same or similar incident recurs. Track this formally:
- Tag incidents with root cause categories in the incident register — allows pattern detection across incidents over time
- At each PIR, check: have we had incidents with the same root cause before? If yes, the previous PIR's action items either were not completed or were not effective
- 30/60/90 day follow-up reviews on P1 and P2 action items — confirm implementation was completed and that early metrics show the fix is working
- Annual governance review: examine the incident register for patterns — which root cause categories account for most incidents? Prioritise systemic improvements accordingly.
Checklist: Do You Understand This?
- What does "blameless" mean in the context of an AI postmortem, and why is it particularly important for AI incidents?
- Apply the five-whys technique to the scenario: "the model denied a loan to a qualified applicant from a protected group."
- What indicators suggest a root cause is systemic rather than one-off?
- What must every action item in a PIR include to be effective?
- How do you determine whether a previous PIR's remediation actually worked?