1001Ferramentas
🩺Generators

Template Postmortem

Gera template de postmortem blameless (resumo, timeline, root cause, ações).

Postmortem Markdown

Blameless postmortem: from aviation safety to software engineering

A postmortem (often called a blameless post-mortem) is a written analysis of an incident after it has been resolved: what happened, what was the impact, why it happened, how the team responded, and what will change so it does not happen again. The format was borrowed from aviation safety culture — the NTSB (National Transportation Safety Board) has been publishing detailed accident reports for decades — and adapted to software by Google in the SRE book chapter "Postmortem Culture: Learning from Failure". The defining adjective is blameless: the goal is to find the systemic cause, not to assign blame to an individual.

A canonical postmortem contains: Summary (one paragraph), Impact (users affected, duration, revenue, SLO burn), Timeline (timestamped events from detection to recovery), Root cause (often arrived at via the 5 Whys), Contributing factors, Detection (how did we find out?), Response, Recovery, Lessons learned and Action items (each with an explicit owner and a due date). The 5 Whys technique was created by Taiichi Ohno at Toyota in the 1950s: keep asking "why" until you stop hitting symptoms and start hitting structure.

Why "blameless" matters

Blamelessness is not about being polite — it is an engineering investment in psychological safety. When engineers know they will not be punished for honestly describing what happened, they tell the truth, and the team learns. The opposite — a culture that blames "human error" — produces postmortems where everyone hides the interesting details, and the same incident recurs six months later. The right reframing is "what allowed the human error?": missing guardrails, ambiguous tooling, alerts that fired too late, runbooks that were wrong. The combination "blameless plus accountable" is sometimes called a just culture.

Severities, metrics and famous postmortems

Most teams classify incidents by severity — SEV1 (critical, all hands on deck), SEV2 (major), SEV3 (minor) — and reserve mandatory written postmortems for SEV1 and SEV2. Key metrics include MTTD (Mean Time To Detect), MTTR (Mean Time To Recovery) and MTTF (Mean Time To Failure). Publicly published postmortems worth reading: AWS S3 (February 2017, a typo in a command), Cloudflare (July 2019, a bad regex deployed globally), GitHub (October 2018, the "Octopus" split-brain), Slack (January 2022), Fastly (June 2021, a single customer config triggered a global outage).

Tools and action-item discipline

Modern incident-management platforms — Atlassian Statuspage, FireHydrant, incident.io, Jeli (incident analytics, now PagerDuty), Rootly — build the postmortem template into the incident lifecycle. Action items should be SMART (Specific, Measurable, Achievable, Relevant, Time-bound) and each one must have a named owner and a date. The dirty secret of postmortems is that the document itself is worth very little; what changes the future is whether the action items are actually completed. A weekly or monthly review meeting that walks through open action items is the difference between a learning culture and a postmortem ritual.

FAQ

Should postmortems be public or internal? Both, usually. The internal version is longer and more detailed (timestamps, customer names, internal tool names); the external version is a customer-facing communication on the status page. AWS, Cloudflare and GitHub all publish external versions for major incidents.

Is a postmortem mandatory? For SEV1 and most SEV2, yes. Treat the "should we write one?" decision as a default-yes; the cost of writing is low compared to the cost of repeating the incident.

When should the postmortem be written? Within one week of the incident, while memory is still fresh. Drafting often starts during the incident itself (timeline, key decisions) and is finalised after a structured retrospective.

Is "human error" ever a valid root cause? No. If a human action triggered the incident, the real root cause is the system that allowed a single human action to have that effect. Reframe every "human error" answer as "what was missing that would have prevented this?".

Related Tools