Claude Cowork for Runbook Generation: Turn Tribal Knowledge Into Searchable Docs

The most dangerous person on your on-call rotation is your most experienced engineer — not because they're unsafe, but because so much operational knowledge lives exclusively in their head. When every runbook gap is filled by "just ask Sarah," your platform's operational knowledge is a single point of failure. Claude Cowork for runbook generation is how you systematically extract that knowledge before the next incident, the next offboarding, or the next midnight page that Sarah misses.

This article is part of the Claude Cowork for DevOps and platform engineers series. It covers the specific runbook generation workflow — what inputs to feed Cowork, which prompts to use, and how to validate the output before it goes into production. The companion articles cover incident post-mortems, infrastructure documentation, and DevOps automation workflows.

What Makes a Good Runbook (and Why Most Aren't)

A runbook is useful at 3am when an engineer who's been asleep for 3 hours is trying to resolve a P1 without escalating. That's the test. By that standard, most runbooks fail: they're too high-level, they assume context the on-call doesn't have, they're outdated, or they simply don't exist.

The specific failures that Cowork-generated runbooks address:

Too high-level: Cowork reads your actual scripts and generates step-by-step commands, not "restart the service" but the exact command with the correct flags and the expected output.
Missing failure modes: By reading post-mortem history, Cowork populates the "common failure modes" section with actual failure patterns your team has experienced, not theoretical ones.
Outdated: Runbooks generated from current scripts and configs reflect the current state. The runbook review workflow (described below) catches drift.
Don't exist: The generation workflow reduces runbook creation from 4–8 hours to 45–60 minutes per service, making it feasible to cover the backlog.

The Cowork Runbook Extraction Workflow (5 Steps)

Collect the source materials

For each service, collect: deployment scripts (deploy.sh, Makefile, CI/CD pipeline config), monitoring query files or dashboard exports, existing architecture notes or README files, the last 5 post-mortems for the service (if available), and any Slack threads where someone explained how the service works. Drop all of this into a Cowork canvas dedicated to this service.

Run the primary generation prompt

Use the structured prompt below. Cowork reads all the source materials simultaneously — this is the canvas advantage. It cross-references the deployment scripts against the post-mortem failures and produces a runbook that covers what you know happened, not just what you thought might happen.

Identify and fill the gaps

Cowork's output includes a section titled "Gaps Found" — places where it inferred information or found source materials that contradicted each other. These are the sections that need review by the service owner. Budget 20–30 minutes with the relevant engineer to fill these in. This is faster than starting from scratch and far more structured than a free-form knowledge transfer.

Validate with the on-call rotation

Before the runbook goes live, run the gap analysis prompt (below) to compare it against your last 5 incidents. Ask Cowork: "Does this runbook cover what actually went wrong in each of these incidents?" If the answer is no for any incident, you have a specific gap to fill.

Publish and schedule review

Publish to Confluence via the connector. Set a review date — quarterly for active services. When infrastructure changes deploy, use the Cowork infrastructure documentation workflow to update the runbook automatically. The runbook is now a living document, not a one-time artefact.

Runbook Generation Prompt Templates

Primary Runbook Generation Prompt

Generate an operational runbook for [SERVICE NAME].

Source materials provided:
- Deployment scripts [attached: deploy.sh, rollback.sh]
- CI/CD pipeline config [attached]
- Architecture notes / README [attached]
- Recent post-mortems [attached: list]
- Monitoring dashboard export [attached if available]

Generate a complete runbook with these sections:

1. SERVICE OVERVIEW
   - What it does (2 paragraphs: business purpose + technical role)
   - SLOs: availability target, latency target, error budget
   - Key dependencies (upstream and downstream services)
   - Owned by: [team/team lead]

2. OPERATIONAL PROCEDURES
   For each procedure, include: pre-conditions, exact commands with flags, expected output, what to check if the command fails.
   - Deployment procedure
   - Rollback procedure
   - Scaling procedure (if applicable)
   - Configuration change procedure

3. MONITORING AND ALERTING
   - Primary dashboard URL
   - Key metrics to watch (name, normal range, alert threshold)
   - Log locations (with example query for common searches)
   - Alert definitions: what each alert means and first response

4. COMMON FAILURE MODES
   Use the post-mortems to identify what has actually gone wrong, not just what could go wrong.
   For each failure mode:
   - Symptoms (what the on-call sees)
   - Likely root causes
   - Diagnostic steps (with specific commands)
   - Resolution steps
   - Escalation trigger (when to escalate vs. keep investigating)

5. ESCALATION PATH
   - First escalation: [role/person] when [condition]
   - Second escalation: [role/person] when [condition]
   - Vendor escalation: [if third-party dependencies]

6. DISASTER RECOVERY
   - Data backup location and restoration procedure
   - Worst-case recovery steps (service completely down, data potentially lost)

GAPS FOUND: At the end, list any sections where you inferred information or found inconsistencies in the source materials. Be specific.

Runbook Gap Analysis vs. Incident History

I have a runbook draft for [SERVICE NAME] [attached] and the last 5 post-mortems for this service [attached].

For each post-mortem, answer:
1. Is the failure mode described in the post-mortem covered in the runbook?
2. If yes: are the diagnostic steps accurate based on what the post-mortem says actually worked?
3. If no: what section of the runbook should cover this and what should it say?

Produce a gap analysis table:
| Incident | Covered? | Accurate? | Recommended Addition |

Then produce the specific text additions for each gap — formatted ready to insert into the runbook.

Runbook Update After Infrastructure Change

The infrastructure for [SERVICE NAME] changed with this Terraform apply [attach plan output] deployed on [DATE].

Current runbook [attached].

Update the runbook for:
1. Any changed commands (flags, endpoints, resource names)
2. Any new failure modes introduced by the change
3. Updated monitoring references (if metrics or dashboards changed)
4. Rollback procedure update (reflect the new infrastructure state)

Show me the specific sections to update with the before/after text. Don't rewrite sections that don't change.

Runbook Quality Checklist

Before a Cowork-generated runbook goes into production, run this checklist:

Runbook Quality Checklist

✓ Every command includes the exact syntax with flags — not "restart the service" but the command

✓ Expected outputs are described — what does "success" look like for each command

✓ Failure modes cover the last 3 actual incidents for this service

✓ Escalation path names a real person or role, not just "escalate to engineering"

✓ A new on-call engineer who's never touched this service could follow it at 3am

✓ All Cowork-flagged gaps have been filled by the service owner

✓ Rollback procedure is as detailed as deployment procedure

✓ Review date is set and the calendar invite exists

Prioritising the Runbook Backlog

If your team has 40 services with no runbooks and capacity to write 4 per week, you need a prioritisation framework. Use the Cowork runbook gap analysis automation to generate the prioritised list automatically — services ranked by incident frequency multiplied by runbook coverage gap.

The manual version of this prioritisation:

Services with P1/P2 incidents in the last 90 days and no runbook — these first, always
Services on the on-call rotation where the primary owner is leaving or moving teams — before they go
Services that junior engineers regularly escalate to senior engineers for — the knowledge gap is already visible
New services deployed without documentation — before the first incident

The knowledge transfer session: For complex services where the runbook generation needs expert input, schedule a 30-minute session with the service owner. Record it (with permission). Feed the transcript to Cowork alongside the scripts and configs. The combination of technical artefacts and verbal explanation produces the most complete runbooks.

The platform engineering ROI case for this work is covered in detail in our companion article on Claude Cowork ROI in platform engineering. The short version: teams with complete, current runbooks resolve P1 incidents 40–60% faster than those without, and junior engineers can operate services independently rather than requiring escalation to senior staff.

Related Resources

Claude Cowork for DevOps — Full Guide

8 Cowork Automations for DevOps

Incident Post-Mortems with Cowork

Infrastructure Documentation

Platform Engineering ROI

Claude Cowork Product Guide

Cowork Deployment Service

Claude API Enterprise Guide

Frequently Asked Questions

What if the service has no existing documentation at all?

This is the most common case. Cowork can generate a runbook from code alone — it reads deployment scripts, CI/CD configs, Kubernetes manifests, and Terraform files and derives the operational procedures from the infrastructure definition. The resulting runbook is more accurate than a human would write from memory, though it will have more gaps in the "failure modes" section (since those come from experience). Pair the code-derived draft with a 30-minute knowledge transfer session with the service owner for the most complete result.

How do we keep runbooks current after the initial generation?

Three mechanisms: First, the quarterly review — schedule it when the runbook is published and actually do it. Second, the infrastructure change update workflow — whenever a Terraform change deploys, run the runbook update prompt to identify what changed. Third, post-mortem feedback — when an incident reveals a runbook gap, use the gap analysis prompt to update the runbook the same day. Teams that do all three maintain runbooks that are actually useful. The Cowork automation for runbook gap analysis runs quarterly and flags stale runbooks automatically.

Can Cowork generate runbooks from Kubernetes manifests and Helm charts?

Yes. Feed Cowork the Kubernetes manifests, the Helm chart values files, and any Helm hooks. Cowork understands Kubernetes resource types and will generate deployment, scaling, and rollback procedures based on the actual manifest configuration. The output includes kubectl commands specific to your namespace and deployment names. If you use ArgoCD or Flux for GitOps, include those configuration files and Cowork will document the deployment process through your actual workflow rather than generic kubectl commands.

How accurate is Cowork when inferring failure modes from post-mortems?

High for documented patterns, uncertain for patterns that haven't been documented. Cowork extracts failure modes from what the post-mortems explicitly describe. If your post-mortems are vague ("service was slow, restarted it, resolved") then Cowork's failure mode section will be correspondingly thin. The output quality mirrors the input quality. This is actually a useful diagnostic: if Cowork can't produce useful failure modes from your post-mortems, your post-mortems aren't capturing enough detail — which points back to the post-mortem workflow improvement.

Should we store runbooks in Confluence, Notion, or somewhere else?

Where your on-call rotation will look first at 3am. If your team uses Confluence for everything, runbooks go in Confluence. If they use Notion, use Notion. The worst runbook location is the one nobody checks first. Cowork has connectors for both Confluence and Notion. For teams that want runbooks accessible in the terminal or IDE context (where the on-call engineer is working), there are also options for storing them in your code repository alongside the service.

Knowledge Management

Stop Depending on People Who Might Leave

We deploy Claude Cowork for platform engineering teams with runbook library generation included. Extract the tribal knowledge before your next incident.

Book a Free Strategy Call View Claude Cowork Guide