Engineering

Postmortems That Change Behavior

Ravinder·October 15, 2025·9 min read

EngineeringPostmortemsReliabilityIncident Management

We had a P1 incident in March 2023 that took down our payment processing for 47 minutes. We wrote a thorough postmortem. We identified five action items. We assigned owners. We closed the incident.

In September 2023, we had an incident caused by the same class of problem. When I pulled up the March postmortem and looked at the five action items, three of them had been marked "done" with no verification, one was still open after six months, and one had been silently abandoned. We had done the ritual of the postmortem without producing any of the actual behavioral change it's supposed to create.

This is the normal state of postmortem culture at most engineering organizations. The documents accumulate. The behaviors don't change.

What "Blameless" Actually Means (and Doesn't Mean)

Blameless postmortems are a well-established concept that gets misapplied in two opposite directions.

The first misapplication is false blamelessness — softening language to the point where root causes get obscured. "The deployment process had room for improvement" when the truth is "the deployment script silently ignored exit codes and we shipped broken code." False blamelessness feels psychologically safe but produces action items that are too vague to act on.

The second misapplication is weaponized blamelessness — using the framework as a shield to avoid accountability for outcomes while nominally accepting accountability for only systems. "The system created the conditions" can be true and still be used to avoid the harder conversation about why a known fragile system was left unaddressed for two quarters.

True blamelessness means: we don't attribute malicious intent or personal failing, we do attribute specific decisions and actions to specific people (without shame), and we distinguish between decisions that were reasonable given available information and decisions that were made without adequate information that should have been available.

A useful test: would the person named in the postmortem feel the document is fair? Not comfortable — fair. If they'd feel it's unfair, it's probably wrong. If they'd feel comfortable, it's probably too soft.

The Template Anatomy That Works

Most postmortem templates are too long. Engineers copy-paste them, fill in the obvious sections, and skip the hard ones because the template doesn't force them to be specific. Here's the section structure that produces usable documents:

## Incident Summary
- Date, duration, severity, services affected
- One sentence: what broke, what it affected, how it was resolved
 
## Timeline
Chronological facts only. No interpretation here.
- HH:MM UTC — [Event]
- HH:MM UTC — [Event]
 
## Impact
Quantified where possible.
- Users affected: N (measured how?)
- Revenue impact: $X (estimated how?)
- Data impact: none / reversible / permanent
 
## Root Cause
The technical condition that made the incident possible.
One paragraph. Specific. No "miscommunication" as a root cause.
 
## Contributing Factors
Conditions that amplified impact or delayed detection.
Bullet list. Each factor should be actionable.
 
## What Went Well
Honest. Not performative. What actually helped?
 
## Action Items
| Item | Owner | Due Date | Verification Method |
|------|-------|----------|---------------------|
| ... | @name | YYYY-MM-DD | How we confirm it's done |
 
## Follow-up Review Date
Date when this postmortem's action items will be reviewed in team retro.

The single most important addition most templates lack: Verification Method for every action item. "Add a health check to the deployment pipeline" is not done when the PR is merged. It's done when the health check catches its first real failure (or when you've simulated a failure and confirmed it catches it). Writing the verification method at the time of the postmortem forces specificity that prevents the "marked done, never verified" failure mode.

flowchart TD I[Incident Resolved] --> D[Draft Postmortem < 24h] D --> R[Async Review by Participants 48h] R --> M[Postmortem Meeting: 30 min max] M --> A[Action Items with Owners + Due Dates + Verification] A --> T[Track in Engineering Project Board] T --> F{Follow-up Review at Team Retro} F -->|Items verified| C[Close postmortem] F -->|Items incomplete| E[Escalate or re-scope] E --> T

Action Items with Owners: The Specificity Requirement

"Improve alerting" is not an action item. "Add a p99 latency alert on the checkout service with a 2-second threshold that pages the on-call" is an action item. The specificity requirement sounds obvious but requires active enforcement in the moment.

The key questions to drive action item specificity:

What exactly will change? If you can't describe the changed state in one sentence, the item isn't specific enough.

Who specifically owns it? Team names are not owners. @alice owns it, with @bob as backup if @alice is unavailable. One human being.

When is it due? "Soon" and "next sprint" are not due dates. A calendar date is a due date. The date should be aggressive: most postmortem action items should be completable within two weeks. Items that take longer are architectural changes and should be tracked as projects, not postmortem items.

How will completion be verified? Describe the specific test, metric, or artifact that constitutes evidence of completion. "PR merged" is acceptable only for documentation changes. For code changes, the verification is usually "we can demonstrate that this incident would have been detected/prevented by running the same scenario."

Here's the table structure we use in Notion, which our incident bot auto-creates:

// Incident bot action item schema
interface PostmortemActionItem {
  id: string;
  incidentId: string;
  description: string;         // Specific, one sentence
  owner: SlackUserId;
  backupOwner?: SlackUserId;
  dueDate: ISODateString;
  verificationMethod: string;  // How completion is confirmed
  status: "open" | "in-progress" | "complete" | "deferred";
  completedAt?: ISODateString;
  verifiedBy?: SlackUserId;    // Different person than owner
  linkedPR?: GitHubPRUrl;
}

Note that verifiedBy is a separate person from owner. Self-verification defeats the purpose.

The Retro on the Retro

Every quarter, we run a "meta-retrospective" on our postmortem process itself. This sounds like a parody of process-culture, and the first time I proposed it I got exactly that response. But it's produced the most durable improvements to our incident management culture.

The meta-retro reviews a sample of the last quarter's postmortems and asks:

What percentage of action items were completed on time? (We track this automatically.)
What percentage of completed action items were actually verified?
Are there recurring root cause patterns across incidents? (Same root cause appearing in multiple incidents means action items from the first incident weren't effective.)
Did any incident repeat a root cause from a previous postmortem?

That last question is the uncomfortable one. If the same class of problem appears twice, either the action items were wrong (they addressed symptoms, not the root cause) or they weren't completed. Both outcomes need a different response.

quadrantChart title Postmortem Action Item Outcomes x-axis Low Specificity --> High Specificity y-axis Low Follow-through --> High Follow-through quadrant-1 Best outcome: verifiable improvement quadrant-2 Items done but hard to confirm impact quadrant-3 Vague items, abandoned quadrant-4 Specific items, not completed: escalation needed "March P1 item 1": [0.8, 0.9] "March P1 item 2": [0.7, 0.2] "March P1 item 3": [0.3, 0.5] "Sept P1 item 1": [0.9, 0.85] "Sept P1 item 2": [0.6, 0.8]

The pattern in the quadrant chart tells you where to focus process improvement. Items in the bottom-left (vague, abandoned) indicate a template problem. Items in the bottom-right (specific, not completed) indicate a prioritization or capacity problem that needs engineering management attention.

Follow-Through Mechanics

The postmortem document is not the product. The behavioral change is the product. Getting to that requires follow-through infrastructure:

Automated reminders: Our incident bot sends a Slack message to the action item owner 3 days before due date and on the due date. No manual tracking required.

Public visibility: Action items are on the engineering team's shared board. They're not hidden in an incident management tool that only the on-call team sees. When an action item is overdue, it's visible to the whole team.

Quarterly incident review in all-hands: We dedicate 10 minutes of our monthly engineering all-hands to incident trends. Not to shame individuals, but to show the pattern: how many incidents this quarter, what categories, which action items closed, which recurring patterns we're still addressing. This keeps incident culture from being purely a reactive, on-call-team concern.

Hard stop on new feature work: When a root cause repeats (incident of the same class appears twice), we enforce a mandatory 1-week moratorium on new feature work for the team that owns the affected system. This is controversial. It's also the only mechanism we've found that creates real urgency around addressing systemic issues rather than treating them as backlog items.

Writing the Document Under 24 Hours

The postmortem should be in draft form within 24 hours of incident resolution. Not polished — drafted. The reason is memory decay: technical details that seem obvious in the moment of the incident become fuzzy within 48 hours. The exact sequence of decisions made during the incident, the specific error messages that appeared, the hypotheses that were considered and ruled out — all of this needs to be captured while it's fresh.

Our process: the incident commander opens the postmortem doc immediately upon declaring the incident resolved and fills in the timeline from the incident's Slack channel (we log all incident communication to a dedicated channel automatically). The draft goes to the team for async review within 24 hours. The synchronous postmortem meeting happens within 72 hours and is capped at 30 minutes.

The 30-minute cap is not a compromise on quality. It's a forcing function. If the meeting is going to run long, the doc wasn't clear enough before the meeting. The prep is where the analysis happens; the meeting is where alignment happens.

Key Takeaways

Blamelessness is about removing shame, not removing accountability — specific decisions and actions should still be attributed to specific people, described fairly and without malice.
The most important addition to any postmortem template is a verification method for every action item — "PR merged" is not verification for code changes that need to demonstrably prevent recurrence.
Action items need a single named human owner, a calendar due date, and a verification method written by someone who isn't the owner.
A quarterly meta-retrospective on your postmortem process surfaces recurring root cause patterns and tracks follow-through rates — if the same class of incident appears twice, the action items from the first weren't effective.
Public visibility of action item status (not hidden in an on-call tool) and automated reminders are the two follow-through mechanics with the highest leverage.
The postmortem document should be drafted within 24 hours while memory is sharp; the synchronous meeting should be capped at 30 minutes and used for alignment, not analysis.