How to write a postmortem (without throwing anyone under the bus)
A postmortem is the document a team writes after a production incident: what happened, why, and what we'll change so it doesn't happen again. They're standard at every well-run engineering team, and they're often the highest-leverage document an organization produces. Most are written badly. Here's the version that actually works, including the tone that keeps it useful instead of political.
What a postmortem actually is
The setup: something broke in production. The site went down, or a bug shipped that affected users, or data got corrupted. The team responded, the issue is now resolved. Now you write a postmortem.
The goal of the document is not to assign blame. The goal is for the team to walk away with one or two concrete changes that prevent the same class of incident in the future. If you can't point to those concrete changes at the end, the postmortem failed regardless of how detailed the writeup is.
Postmortems are sometimes called: "incident reports," "RCAs" (root cause analyses), "after-action reports." Same idea, different industries.
"Blameless" is not "blame-free"
Modern engineering postmortems are explicitly blameless. That doesn't mean nobody is responsible; it means we treat the incident as the result of system failure, not individual failure. People made decisions that seemed reasonable given the information they had. The system allowed those decisions to cause an incident. The fix is in the system, not in disciplining the human.
The blameless framing is practical, not just kind. If engineers fear punishment for incidents, they hide problems instead of reporting them. The team gets less information and more incidents. Companies that do blameless well (Google SRE famously) use it because it produces better outcomes, not because it feels nicer.
What blameless looks like in writing:
- Bad: "Alex deployed a config change that crashed the service."
- Good: "A config change was deployed that crashed the service. Our deploy tooling did not catch the misconfiguration because we don't currently lint config changes."
Same fact, different lens. The first version blames Alex. The second names a system gap that's actionable.
The structure
A standard postmortem has these sections, in order. You can rearrange or rename slightly, but every section should appear.
1. Summary (one paragraph)
What broke, when, who was affected, how long. Three to five sentences. This is what 80% of readers will only ever read.
Example:
On April 28 from 14:32 to 15:18 PDT (46 minutes), checkout was unavailable for users in the EU region due to a misconfigured database connection pool. Users could browse but could not complete purchases. Estimated revenue impact: $12,000. The issue was resolved by reverting the config change.
2. Impact
Quantified consequences. Numbers, not adjectives.
- How many users affected (count or percentage)
- Duration of degradation
- Money lost (revenue, refunds, SLA penalties) if applicable
- Operational toll (engineers paged, hours spent responding)
- External communication needed (status page, customer emails)
3. Timeline
The minute-by-minute account of the incident. Use UTC or a single time zone consistently. Each event is one line.
14:32 - Config change deployed (commit abc123)
14:34 - Error rate begins climbing in EU region (Datadog dashboard)
14:36 - PagerDuty alert fires for backend-prod-eu
14:38 - On-call (Sarah) acknowledges page, starts investigation
14:42 - Sarah identifies elevated DB connection errors
14:47 - Sarah escalates to DB team in #incident-response
14:51 - DB team identifies the config change as the suspected cause
14:55 - Decision to revert the change (incident commander: Sarah)
14:58 - Revert deployed (commit def456)
15:08 - Error rate returns to baseline
15:18 - Status page updated to "all clear"
The timeline is the spine of the document. Most insights come from looking at the gaps: "why did it take 11 minutes from deploy to alert? Should the threshold be tighter?"
4. Root cause
The technical explanation. What actually broke. Be specific. Show code, config, or queries where helpful.
Example:
A config change to db.poolSize was deployed at 14:32, reducing the connection pool from 50 to 5. This was a typo (intended value was 50, but a 0 was dropped). The application began rejecting requests within 90 seconds as the small pool exhausted under normal load.
5. Contributing factors
The conditions that allowed the root cause to result in an incident. This section is where most learning happens.
Each factor is a system gap that, if absent, would have prevented or shortened the incident. Phrase them as observations, not accusations.
- The deploy pipeline didn't validate config values against expected ranges.
- The metric for "DB connection pool saturation" exists but had no alert attached.
- The on-call runbook didn't include "check recent config changes" in the first three steps.
- Config changes deploy without staging environment validation.
You're naming the holes in the Swiss cheese, not who lined them up.
6. What went well
Often skipped, always worth including. Calls out the parts of the incident response that worked, so the team knows what to keep doing.
- The on-call acknowledged the page within 4 minutes of firing.
- Cross-team escalation to the DB team happened quickly via the established #incident-response channel.
- The status page was updated in real time.
- The revert path was tested and worked on first try.
7. Action items
The most important section. A short list of specific, owned, scheduled changes that come out of this incident.
Each action item must be:
- Specific: "Add a CI lint rule for db.poolSize range" not "improve config validation"
- Owned: a name attached to it
- Scheduled: a target date or sprint
- Verifiable: someone can check whether it was done
Example:
| # | Action item | Owner | Due |
|----|--------------------------------------------------------|----------|-----------|
| 1 | Add CI lint rule: db.poolSize must be in range 10-200 | Sarah | May 5 |
| 2 | Add alert on connection pool saturation > 80% | DB team | May 12 |
| 3 | Add "check recent config deploys" to oncall runbook | Sarah | May 5 |
| 4 | Require staging deploy + 10min soak for config changes | Platform | May 20 |
Three to five action items is right. Twenty action items is a wishlist that won't get done. Pick the highest-leverage three.
8. Lessons learned (optional)
If there's a broader lesson worth surfacing beyond the action items, name it. "We don't currently treat config as code with the same rigor as application code" is a lesson. The action items above are concrete steps; the lesson is the pattern that should change.
The tone that works
Write the postmortem like you're explaining to a thoughtful new engineer who joins the team next month. They weren't there. They have no context. They want to learn. Three rules:
1. Use passive voice for actions, active voice for fixes
Passive voice is usually bad writing, except in postmortems where it depersonalizes the cause. "A config change was deployed" is appropriate. "Sarah deployed a config change" shifts the tone toward blame even if not intended.
For the fixes and action items, active voice is fine because the agency is the team's own.
2. Distinguish facts from interpretations
Fact: "Error rate climbed from 0.1% to 47% over 90 seconds."
Interpretation: "This caused user frustration."
Both can appear in a postmortem, but mark which is which. Reviewers can argue interpretations; they can't argue facts. The split forces clarity.
3. Don't speculate
If you don't know, say so. "It's unclear why the alert took 4 minutes to fire; this is being investigated separately." Speculation about causes you can't confirm makes the document less trustworthy and lowers the impact of the parts you can confirm.
Practice the workflow that keeps things from breaking
InternQuest's missions train the muscle of small, careful changes that survive code review: conventional commits, focused scope, automated PR feedback. The free virtual SWE internship simulator. Because the best postmortem is the one you didn't have to write.
Try a mission →Common postmortem failure modes
Naming the engineer in the timeline
Even with blameless intent, you'll often see "Alex deployed the change at 14:32." This implicitly blames Alex even when nobody intends to. Name roles instead: "the on-call engineer," "the deployer," "the DB team."
Exception: when crediting a quick action ("Sarah identified the cause within 4 minutes"), naming is fine. The asymmetry is intentional: people get credit, systems get blamed.
Action items that aren't actions
"Be more careful with config changes." Not an action item. Nothing testable, nothing scheduled. The fix should be a system change: a lint rule, an alert, a process change.
Multiple "root causes"
You usually have one technical root cause and several contributing factors. Lumping them together as "root causes" muddies the analysis. Pick the one thing that, if absent, would have prevented the incident, and call it the root cause. Everything else goes in contributing factors.
The five whys gone wild
The "five whys" technique (ask "why?" five times to drill into root cause) is useful but easy to misuse. By the fifth why you can usually reach "because the company exists" or some equally unactionable abstraction. Stop when you hit something the team can actually change.
Postmortem-as-performance-review
If you're using a postmortem to make the case that someone should be fired or formally reprimanded, you're using the wrong document. That's an HR conversation that happens separately, with HR. The postmortem is for systems learning, in public to the team.
No follow-through
The single biggest postmortem failure is writing one and then never doing the action items. Track them in your team's backlog with deadlines. Review at the next retro. If your team writes 10 postmortems a year and completes 30% of action items, you're roughly tied with most of the industry, and improving that number is the cheapest reliability investment you can make.
An intern's role in postmortems
You probably won't lead one. You might attend the postmortem meeting, where the team reviews the draft together. Useful things to do as an intern:
- Read past postmortems for your team. Most teams keep a folder. Reading 5 to 10 of them is the fastest way to learn how the system actually fails and what the team has historically struggled with.
- Ask clarifying questions. The new-engineer perspective often catches assumed knowledge that should be made explicit.
- If you contributed to an incident, say so honestly. "I deployed the config change without realizing it skipped staging" is exactly the kind of factual contribution that makes the document better. The team's response to your honesty (in a healthy team) is supportive, not punitive.
- Volunteer to write one. If a small incident happens during your internship and your manager is open to it, offer to write the postmortem. It's a high-visibility document and the practice is invaluable.
The minimum viable postmortem
If your team doesn't have a template, the absolute minimum is:
- What happened (one paragraph)
- Timeline (5 to 15 lines)
- Root cause (one paragraph with technical detail)
- Action items (3 to 5, owned, dated)
Even that bare-bones version is enormously valuable if a team had nothing before. Don't let perfect be the enemy of "we wrote it down."
Where to publish them
Internal-only is the default. Some teams maintain a public postmortem culture (Stripe, GitLab, and Cloudflare have famous ones). The choice depends on the company's openness norms; don't make it unilaterally as an intern.
Internal postmortems should be readable across the company, not just the team. The whole point is that another team running into the same class of issue can find the lesson.
The shift in mindset
An incident isn't a failure of an engineer. It's the team learning something the system didn't already know. The postmortem is the artifact of that learning. Done well, it makes the next year better. Done badly, it just produces a document nobody reads and a quietly traumatized engineer.
Write them blamelessly. Make them concrete. Track the action items. Repeat. The teams that get this right tend to be the same teams that ship reliably and grow engineers fast. Both come from the same root: treating mistakes as data instead of failure.
Build the workflow muscle that prevents most incidents
The best postmortems are the ones never needed because conventional commits, careful PRs, and code review caught the issue first. InternQuest is a free virtual software engineering internship simulator that drills exactly that loop on real broken code.
Try a mission →