1001Ferramentas
📕Generators

Template Runbook

Gera template de runbook operacional (alertas, diagnóstico, mitigação).

Runbook Markdown

Runbook: the operations playbook every on-call needs

A runbook is a step-by-step operational document that tells an engineer exactly what to do when a known scenario happens — a deploy, a restart, a secret rotation, a disaster recovery, an alert that just fired at 03:00. The discipline of writing runbooks was popularised by Google's Site Reliability Engineering book (O'Reilly, 2016), which made the case that any service in production must come with a runbook before it goes on-call. The economic argument is simple: a well-written runbook can cut MTTR (Mean Time To Recovery) by 50% or more, because the person responding does not have to discover the procedure during the incident.

A solid runbook has predictable sections: When to use (the trigger — an alert ID, a customer ticket, a scheduled window), Prerequisites (required access, tools, VPN, kubectl context), Steps (numbered, copy-pasteable commands), Verification (how do I know it worked?), Rollback (how do I undo it if it didn't?), Escalation (who do I page after N minutes?) and Related runbooks. Format matters: most teams keep runbooks as Markdown in a git repo (versioned, reviewable in PRs), and mirror or render them into Confluence or Notion for searchability.

Categories of runbook

Most teams converge on three families: incident response (triggered by an alert; PagerDuty or Opsgenie links the alert directly to the runbook URL), scheduled (cron-style — certificate rotation, monthly DB vacuum, backup verification) and maintenance windows (deploys, major upgrades, scheduled migrations). Each family has different tolerance for ambiguity — incident-response runbooks must be runnable in a panic; scheduled ones can afford to be longer and more explanatory.

Tooling, automation and "runbook as code"

Modern stacks tie runbooks to alerts and to executable code. PagerDuty and Opsgenie let you attach a runbook URL to every alert. Jira Service Management, ServiceNow and Backstage (Spotify's open-source developer portal) host runbooks alongside service catalogs. The natural evolution is "runbook as code": idempotent scripts that humans can read but that machines can also execute — Ansible playbooks, Terraform modules, AWS Systems Manager Run Command, Rundeck. The Markdown runbook becomes the documentation around an automated procedure rather than a sequence of manual steps.

Anti-patterns to avoid

The four classic failures of runbook hygiene: (1) outdated content — "last reviewed 2019" is a red flag; teams should review runbooks at least quarterly, ideally via a Chaos Engineering exercise that actually runs them; (2) verbose preamble — by step 4 the on-call has stopped reading; put the commands up front; (3) missing rollback — every destructive step needs an explicit undo; (4) no verification step — "deploy succeeded" is not the same as "the service is healthy". The Google SRE book and the NSA Cybersecurity Operations runbooks are good public references for shape and tone.

FAQ

Does a runbook replace documentation? No — it complements it. Documentation explains what the system is and why it works that way. A runbook tells you what to do in a specific operational situation.

Is a runbook mandatory? For anything that goes on-call, yes. Most mature SRE teams refuse to take pager duty for a service that does not have runbooks for its top alerts.

Where should runbooks live? The pragmatic answer is "in two places": as Markdown in the service's git repo (versioned, code-reviewed), and rendered or copied into Confluence/Notion (searchable from the incident channel).

How do I know a runbook still works? Run it. Game days and Chaos Engineering exercises (Netflix's Chaos Monkey lineage) exist precisely to validate that runbooks are still accurate against the current production system.

Related Tools