Postmortem templates are easy to find online. Most of them are wrong — or at least incomplete. Here's the template we use after every incident, and why each section matters.
The five sections
1. Summary. What happened, in two sentences a non-technical executive can understand. Skip the jargon. "From 2:14pm to 3:47pm, customers couldn't log in. Cause: an expired certificate that wasn't auto-renewed."
2. Timeline. Every event in chronological order with timestamps in UTC. Detection, escalation, mitigation attempts, resolution. Be honest about what didn't work.
3. Root cause. Singular. The actual underlying issue. Not "the deploy" — why did the deploy fail? Not "human error" — what process let the human error reach production? Keep asking why until you can't go deeper.
4. Contributing factors. Plural. Things that made the incident worse, longer, or harder to detect. Stale runbooks, missing alerts, on-call confusion. These often matter more than the root cause for preventing repeats.
5. Action items. Each one has an owner, a due date, and a measurable outcome. "Improve monitoring" is not an action item. "Add 30-second alert on cert expiry, owned by S. Patel, due May 15" is.
The section most teams forget
Customer impact. Specifically: how many customers, for how long, what could they do and not do, and have we communicated with them honestly?
This isn't a legal disclosure section — it's a discipline. If you write down what your customers experienced, you'll feel different about how seriously to treat the incident. We've watched teams downgrade severity in private after writing the customer impact paragraph and realizing how bad it actually was.
What we don't include
Blame. Names appear only as owners of action items, never as causes. The point of a postmortem is to fix the system, not punish the person who tripped the wire the system left exposed.
Speculation. "Maybe we should consider X" goes in a separate doc. The postmortem is for what happened and what we're doing about it. Concretely.
How long it should take
Within 48 hours of an incident closing, even for sev-3s. Memory degrades fast. We block the postmortem on the calendar before the incident is even closed.