How to write a postmortem (without throwing anyone under the bus)

A postmortem is the document a team writes after a production incident: what happened, why, and what we'll change so it doesn't happen again. They're standard at every well-run engineering team, and they're often the highest-leverage document an organization produces. Most are written badly. Here's the version that actually works, including the tone that keeps it useful instead of political.

What a postmortem actually is

The setup: something broke in production. The site went down, or a bug shipped that affected users, or data got corrupted. The team responded, the issue is now resolved. Now you write a postmortem.

The goal of the document is not to assign blame. The goal is for the team to walk away with one or two concrete changes that prevent the same class of incident in the future. If you can't point to those concrete changes at the end, the postmortem failed regardless of how detailed the writeup is.

Postmortems are sometimes called: "incident reports," "RCAs" (root cause analyses), "after-action reports." Same idea, different industries.

"Blameless" is not "blame-free"

Modern engineering postmortems are explicitly blameless. That doesn't mean nobody is responsible; it means we treat the incident as the result of system failure, not individual failure. People made decisions that seemed reasonable given the information they had. The system allowed those decisions to cause an incident. The fix is in the system, not in disciplining the human.

The blameless framing is practical, not just kind. If engineers fear punishment for incidents, they hide problems instead of reporting them. The team gets less information and more incidents. Companies that do blameless well (Google SRE famously) use it because it produces better outcomes, not because it feels nicer.

What blameless looks like in writing:

Same fact, different lens. The first version blames Alex. The second names a system gap that's actionable.

The structure

A standard postmortem has these sections, in order. You can rearrange or rename slightly, but every section should appear.

1. Summary (one paragraph)

What broke, when, who was affected, how long. Three to five sentences. This is what 80% of readers will only ever read.

Example:

On April 28 from 14:32 to 15:18 PDT (46 minutes), checkout was unavailable for users in the EU region due to a misconfigured database connection pool. Users could browse but could not complete purchases. Estimated revenue impact: $12,000. The issue was resolved by reverting the config change.

2. Impact

Quantified consequences. Numbers, not adjectives.

3. Timeline

The minute-by-minute account of the incident. Use UTC or a single time zone consistently. Each event is one line.

14:32 - Config change deployed (commit abc123)
14:34 - Error rate begins climbing in EU region (Datadog dashboard)
14:36 - PagerDuty alert fires for backend-prod-eu
14:38 - On-call (Sarah) acknowledges page, starts investigation
14:42 - Sarah identifies elevated DB connection errors
14:47 - Sarah escalates to DB team in #incident-response
14:51 - DB team identifies the config change as the suspected cause
14:55 - Decision to revert the change (incident commander: Sarah)
14:58 - Revert deployed (commit def456)
15:08 - Error rate returns to baseline
15:18 - Status page updated to "all clear"

The timeline is the spine of the document. Most insights come from looking at the gaps: "why did it take 11 minutes from deploy to alert? Should the threshold be tighter?"

4. Root cause

The technical explanation. What actually broke. Be specific. Show code, config, or queries where helpful.

Example:

A config change to db.poolSize was deployed at 14:32, reducing the connection pool from 50 to 5. This was a typo (intended value was 50, but a 0 was dropped). The application began rejecting requests within 90 seconds as the small pool exhausted under normal load.

5. Contributing factors

The conditions that allowed the root cause to result in an incident. This section is where most learning happens.

Each factor is a system gap that, if absent, would have prevented or shortened the incident. Phrase them as observations, not accusations.

You're naming the holes in the Swiss cheese, not who lined them up.

6. What went well

Often skipped, always worth including. Calls out the parts of the incident response that worked, so the team knows what to keep doing.

7. Action items

The most important section. A short list of specific, owned, scheduled changes that come out of this incident.

Each action item must be:

Example:

| #  | Action item                                            | Owner    | Due       |
|----|--------------------------------------------------------|----------|-----------|
| 1  | Add CI lint rule: db.poolSize must be in range 10-200  | Sarah    | May 5     |
| 2  | Add alert on connection pool saturation > 80%          | DB team  | May 12    |
| 3  | Add "check recent config deploys" to oncall runbook    | Sarah    | May 5     |
| 4  | Require staging deploy + 10min soak for config changes | Platform | May 20    |

Three to five action items is right. Twenty action items is a wishlist that won't get done. Pick the highest-leverage three.

8. Lessons learned (optional)

If there's a broader lesson worth surfacing beyond the action items, name it. "We don't currently treat config as code with the same rigor as application code" is a lesson. The action items above are concrete steps; the lesson is the pattern that should change.

The tone that works

Write the postmortem like you're explaining to a thoughtful new engineer who joins the team next month. They weren't there. They have no context. They want to learn. Three rules:

1. Use passive voice for actions, active voice for fixes

Passive voice is usually bad writing, except in postmortems where it depersonalizes the cause. "A config change was deployed" is appropriate. "Sarah deployed a config change" shifts the tone toward blame even if not intended.

For the fixes and action items, active voice is fine because the agency is the team's own.

2. Distinguish facts from interpretations

Fact: "Error rate climbed from 0.1% to 47% over 90 seconds."

Interpretation: "This caused user frustration."

Both can appear in a postmortem, but mark which is which. Reviewers can argue interpretations; they can't argue facts. The split forces clarity.

3. Don't speculate

If you don't know, say so. "It's unclear why the alert took 4 minutes to fire; this is being investigated separately." Speculation about causes you can't confirm makes the document less trustworthy and lowers the impact of the parts you can confirm.

Practice the workflow that keeps things from breaking

InternQuest's missions train the muscle of small, careful changes that survive code review: conventional commits, focused scope, automated PR feedback. The free virtual SWE internship simulator. Because the best postmortem is the one you didn't have to write.

Try a mission →

Common postmortem failure modes

Naming the engineer in the timeline

Even with blameless intent, you'll often see "Alex deployed the change at 14:32." This implicitly blames Alex even when nobody intends to. Name roles instead: "the on-call engineer," "the deployer," "the DB team."

Exception: when crediting a quick action ("Sarah identified the cause within 4 minutes"), naming is fine. The asymmetry is intentional: people get credit, systems get blamed.

Action items that aren't actions

"Be more careful with config changes." Not an action item. Nothing testable, nothing scheduled. The fix should be a system change: a lint rule, an alert, a process change.

Multiple "root causes"

You usually have one technical root cause and several contributing factors. Lumping them together as "root causes" muddies the analysis. Pick the one thing that, if absent, would have prevented the incident, and call it the root cause. Everything else goes in contributing factors.

The five whys gone wild

The "five whys" technique (ask "why?" five times to drill into root cause) is useful but easy to misuse. By the fifth why you can usually reach "because the company exists" or some equally unactionable abstraction. Stop when you hit something the team can actually change.

Postmortem-as-performance-review

If you're using a postmortem to make the case that someone should be fired or formally reprimanded, you're using the wrong document. That's an HR conversation that happens separately, with HR. The postmortem is for systems learning, in public to the team.

No follow-through

The single biggest postmortem failure is writing one and then never doing the action items. Track them in your team's backlog with deadlines. Review at the next retro. If your team writes 10 postmortems a year and completes 30% of action items, you're roughly tied with most of the industry, and improving that number is the cheapest reliability investment you can make.

An intern's role in postmortems

You probably won't lead one. You might attend the postmortem meeting, where the team reviews the draft together. Useful things to do as an intern:

The minimum viable postmortem

If your team doesn't have a template, the absolute minimum is:

  1. What happened (one paragraph)
  2. Timeline (5 to 15 lines)
  3. Root cause (one paragraph with technical detail)
  4. Action items (3 to 5, owned, dated)

Even that bare-bones version is enormously valuable if a team had nothing before. Don't let perfect be the enemy of "we wrote it down."

Where to publish them

Internal-only is the default. Some teams maintain a public postmortem culture (Stripe, GitLab, and Cloudflare have famous ones). The choice depends on the company's openness norms; don't make it unilaterally as an intern.

Internal postmortems should be readable across the company, not just the team. The whole point is that another team running into the same class of issue can find the lesson.

The shift in mindset

An incident isn't a failure of an engineer. It's the team learning something the system didn't already know. The postmortem is the artifact of that learning. Done well, it makes the next year better. Done badly, it just produces a document nobody reads and a quietly traumatized engineer.

Write them blamelessly. Make them concrete. Track the action items. Repeat. The teams that get this right tend to be the same teams that ship reliably and grow engineers fast. Both come from the same root: treating mistakes as data instead of failure.

Build the workflow muscle that prevents most incidents

The best postmortems are the ones never needed because conventional commits, careful PRs, and code review caught the issue first. InternQuest is a free virtual software engineering internship simulator that drills exactly that loop on real broken code.

Try a mission →