The most dangerous person on your on-call rotation is your most experienced engineer — not because they're unsafe, but because so much operational knowledge lives exclusively in their head. When every runbook gap is filled by "just ask Sarah," your platform's operational knowledge is a single point of failure. Claude Cowork for runbook generation is how you systematically extract that knowledge before the next incident, the next offboarding, or the next midnight page that Sarah misses.
This article is part of the Claude Cowork for DevOps and platform engineers series. It covers the specific runbook generation workflow — what inputs to feed Cowork, which prompts to use, and how to validate the output before it goes into production. The companion articles cover incident post-mortems, infrastructure documentation, and DevOps automation workflows.
What Makes a Good Runbook (and Why Most Aren't)
A runbook is useful at 3am when an engineer who's been asleep for 3 hours is trying to resolve a P1 without escalating. That's the test. By that standard, most runbooks fail: they're too high-level, they assume context the on-call doesn't have, they're outdated, or they simply don't exist.
The specific failures that Cowork-generated runbooks address:
- Too high-level: Cowork reads your actual scripts and generates step-by-step commands, not "restart the service" but the exact command with the correct flags and the expected output.
- Missing failure modes: By reading post-mortem history, Cowork populates the "common failure modes" section with actual failure patterns your team has experienced, not theoretical ones.
- Outdated: Runbooks generated from current scripts and configs reflect the current state. The runbook review workflow (described below) catches drift.
- Don't exist: The generation workflow reduces runbook creation from 4–8 hours to 45–60 minutes per service, making it feasible to cover the backlog.
The Cowork Runbook Extraction Workflow (5 Steps)
Collect the source materials
For each service, collect: deployment scripts (deploy.sh, Makefile, CI/CD pipeline config), monitoring query files or dashboard exports, existing architecture notes or README files, the last 5 post-mortems for the service (if available), and any Slack threads where someone explained how the service works. Drop all of this into a Cowork canvas dedicated to this service.
Run the primary generation prompt
Use the structured prompt below. Cowork reads all the source materials simultaneously — this is the canvas advantage. It cross-references the deployment scripts against the post-mortem failures and produces a runbook that covers what you know happened, not just what you thought might happen.
Identify and fill the gaps
Cowork's output includes a section titled "Gaps Found" — places where it inferred information or found source materials that contradicted each other. These are the sections that need review by the service owner. Budget 20–30 minutes with the relevant engineer to fill these in. This is faster than starting from scratch and far more structured than a free-form knowledge transfer.
Validate with the on-call rotation
Before the runbook goes live, run the gap analysis prompt (below) to compare it against your last 5 incidents. Ask Cowork: "Does this runbook cover what actually went wrong in each of these incidents?" If the answer is no for any incident, you have a specific gap to fill.
Publish and schedule review
Publish to Confluence via the connector. Set a review date — quarterly for active services. When infrastructure changes deploy, use the Cowork infrastructure documentation workflow to update the runbook automatically. The runbook is now a living document, not a one-time artefact.
Runbook Generation Prompt Templates
Generate an operational runbook for [SERVICE NAME]. Source materials provided: - Deployment scripts [attached: deploy.sh, rollback.sh] - CI/CD pipeline config [attached] - Architecture notes / README [attached] - Recent post-mortems [attached: list] - Monitoring dashboard export [attached if available] Generate a complete runbook with these sections: 1. SERVICE OVERVIEW - What it does (2 paragraphs: business purpose + technical role) - SLOs: availability target, latency target, error budget - Key dependencies (upstream and downstream services) - Owned by: [team/team lead] 2. OPERATIONAL PROCEDURES For each procedure, include: pre-conditions, exact commands with flags, expected output, what to check if the command fails. - Deployment procedure - Rollback procedure - Scaling procedure (if applicable) - Configuration change procedure 3. MONITORING AND ALERTING - Primary dashboard URL - Key metrics to watch (name, normal range, alert threshold) - Log locations (with example query for common searches) - Alert definitions: what each alert means and first response 4. COMMON FAILURE MODES Use the post-mortems to identify what has actually gone wrong, not just what could go wrong. For each failure mode: - Symptoms (what the on-call sees) - Likely root causes - Diagnostic steps (with specific commands) - Resolution steps - Escalation trigger (when to escalate vs. keep investigating) 5. ESCALATION PATH - First escalation: [role/person] when [condition] - Second escalation: [role/person] when [condition] - Vendor escalation: [if third-party dependencies] 6. DISASTER RECOVERY - Data backup location and restoration procedure - Worst-case recovery steps (service completely down, data potentially lost) GAPS FOUND: At the end, list any sections where you inferred information or found inconsistencies in the source materials. Be specific.
I have a runbook draft for [SERVICE NAME] [attached] and the last 5 post-mortems for this service [attached]. For each post-mortem, answer: 1. Is the failure mode described in the post-mortem covered in the runbook? 2. If yes: are the diagnostic steps accurate based on what the post-mortem says actually worked? 3. If no: what section of the runbook should cover this and what should it say? Produce a gap analysis table: | Incident | Covered? | Accurate? | Recommended Addition | Then produce the specific text additions for each gap — formatted ready to insert into the runbook.
The infrastructure for [SERVICE NAME] changed with this Terraform apply [attach plan output] deployed on [DATE]. Current runbook [attached]. Update the runbook for: 1. Any changed commands (flags, endpoints, resource names) 2. Any new failure modes introduced by the change 3. Updated monitoring references (if metrics or dashboards changed) 4. Rollback procedure update (reflect the new infrastructure state) Show me the specific sections to update with the before/after text. Don't rewrite sections that don't change.
Runbook Quality Checklist
Before a Cowork-generated runbook goes into production, run this checklist:
Runbook Quality Checklist
Prioritising the Runbook Backlog
If your team has 40 services with no runbooks and capacity to write 4 per week, you need a prioritisation framework. Use the Cowork runbook gap analysis automation to generate the prioritised list automatically — services ranked by incident frequency multiplied by runbook coverage gap.
The manual version of this prioritisation:
- Services with P1/P2 incidents in the last 90 days and no runbook — these first, always
- Services on the on-call rotation where the primary owner is leaving or moving teams — before they go
- Services that junior engineers regularly escalate to senior engineers for — the knowledge gap is already visible
- New services deployed without documentation — before the first incident
The knowledge transfer session: For complex services where the runbook generation needs expert input, schedule a 30-minute session with the service owner. Record it (with permission). Feed the transcript to Cowork alongside the scripts and configs. The combination of technical artefacts and verbal explanation produces the most complete runbooks.
The platform engineering ROI case for this work is covered in detail in our companion article on Claude Cowork ROI in platform engineering. The short version: teams with complete, current runbooks resolve P1 incidents 40–60% faster than those without, and junior engineers can operate services independently rather than requiring escalation to senior staff.
Related Resources
Frequently Asked Questions
What if the service has no existing documentation at all?
This is the most common case. Cowork can generate a runbook from code alone — it reads deployment scripts, CI/CD configs, Kubernetes manifests, and Terraform files and derives the operational procedures from the infrastructure definition. The resulting runbook is more accurate than a human would write from memory, though it will have more gaps in the "failure modes" section (since those come from experience). Pair the code-derived draft with a 30-minute knowledge transfer session with the service owner for the most complete result.
How do we keep runbooks current after the initial generation?
Three mechanisms: First, the quarterly review — schedule it when the runbook is published and actually do it. Second, the infrastructure change update workflow — whenever a Terraform change deploys, run the runbook update prompt to identify what changed. Third, post-mortem feedback — when an incident reveals a runbook gap, use the gap analysis prompt to update the runbook the same day. Teams that do all three maintain runbooks that are actually useful. The Cowork automation for runbook gap analysis runs quarterly and flags stale runbooks automatically.
Can Cowork generate runbooks from Kubernetes manifests and Helm charts?
Yes. Feed Cowork the Kubernetes manifests, the Helm chart values files, and any Helm hooks. Cowork understands Kubernetes resource types and will generate deployment, scaling, and rollback procedures based on the actual manifest configuration. The output includes kubectl commands specific to your namespace and deployment names. If you use ArgoCD or Flux for GitOps, include those configuration files and Cowork will document the deployment process through your actual workflow rather than generic kubectl commands.
How accurate is Cowork when inferring failure modes from post-mortems?
High for documented patterns, uncertain for patterns that haven't been documented. Cowork extracts failure modes from what the post-mortems explicitly describe. If your post-mortems are vague ("service was slow, restarted it, resolved") then Cowork's failure mode section will be correspondingly thin. The output quality mirrors the input quality. This is actually a useful diagnostic: if Cowork can't produce useful failure modes from your post-mortems, your post-mortems aren't capturing enough detail — which points back to the post-mortem workflow improvement.
Should we store runbooks in Confluence, Notion, or somewhere else?
Where your on-call rotation will look first at 3am. If your team uses Confluence for everything, runbooks go in Confluence. If they use Notion, use Notion. The worst runbook location is the one nobody checks first. Cowork has connectors for both Confluence and Notion. For teams that want runbooks accessible in the terminal or IDE context (where the on-call engineer is working), there are also options for storing them in your code repository alongside the service.
Stop Depending on People Who Might Leave
We deploy Claude Cowork for platform engineering teams with runbook library generation included. Extract the tribal knowledge before your next incident.