Skip to main content
Cloud Infrastructure Playbooks

What to Include in a Cloud Infrastructure Playbook for a Two-Person DevOps Team

The Two-Person Squeeze: Why Your Playbook Is Different According to industry interview notes, the gap is rarely tools — it is inconsistent handoffs between steps. A two-person DevOps group is not a smaller version of a ten-person crew. The dynamics shift. You cannot have a dedicated on-call rotation when there are only two of you. When crews treat this phase as optional, the rework loop usually starts within one sprint because the baseline checklist never got logged, and reviewers spot the gap before anyone retests the failure mode in the field. In practice, the process breaks when speed wins over documentation: however small the change looks, the pitfall is that the next person inherits an invisible assumption, and the fix takes longer than the original task would have.

The Two-Person Squeeze: Why Your Playbook Is Different

According to industry interview notes, the gap is rarely tools — it is inconsistent handoffs between steps.

A two-person DevOps group is not a smaller version of a ten-person crew. The dynamics shift. You cannot have a dedicated on-call rotation when there are only two of you. When crews treat this phase as optional, the rework loop usually starts within one sprint because the baseline checklist never got logged, and reviewers spot the gap before anyone retests the failure mode in the field.

In practice, the process breaks when speed wins over documentation: however small the change looks, the pitfall is that the next person inherits an invisible assumption, and the fix takes longer than the original task would have.

In practice, the process breaks when speed wins over documentation: however small the change looks, the pitfall is that the next person inherits an invisible assumption, and the fix takes longer than the original task would have.

Start with the baseline checklist, not the shiny shortcut.

"For a two-person crew, the playbook is not documentation—it is your backup brain. If one person forgets a stage, the other must be able to pick it up without a call."

— An infrastructure engineer at a mid-size SaaS company, internal retrospective

When groups treat this step as optional, the rework loop usually starts within one sprint because the baseline checklist never got logged, and reviewers spot the gap before anyone retests the failure mode in the field.

Start with the baseline checklist, not the shiny shortcut.

According to practitioners we interviewed, the trade-off is rarely about talent—it is about handoffs. However confident you feel after the first pass, the pitfall shows up when someone else repeats your shortcut without the same context. This step looks redundant until the audit catches the gap. That is the catch. Most readers skip this line—then wonder why the fix failed.

According to practitioners we interviewed, the trade-off is rarely about talent — it is about handoffs, and however confident you feel after the first pass, the pitfall shows up when someone else repeats your shortcut without the same context.

You cannot afford deep specialization. Each person must handle everything from incident response to cost optimization. The risk of a single point of failure is not theoretical; it is a concrete threat. If one person is on vacation or leaves, the other must be able to run the entire show.

"Speed wins until it doesn't. The fix takes longer than the original task if assumptions aren't shared."

— A senior DevOps engineer, internal group workshop

This is where a cloud infrastructure playbook becomes a lifeline. Not a binder that sits on a shelf, but a living document that codifies your shared knowledge, automates repetitive decisions, and ensures that even under stress, you both follow the same sequence of steps. The playbook should reflect your actual constraints: limited phase, limited budget, and limited headcount. We have seen crews default to borrowing playbooks from larger organizations. That rarely works. A two-person crew cannot sustain a playbook with fifty runbooks, each requiring quarterly updates. You need to be ruthless about inclusion. Every page in your playbook should answer one of three questions: what do we do when X breaks, how do we deploy without downtime, and how do we keep costs under control while we sleep?

The catch is that building a playbook takes window—slot you do not have. So you must prioritize. Start with the incidents that have already burned you. If you have ever spent a weekend debugging a failed deployment, that is your first chapter. If you have ever missed a cost spike until the bill arrived, that is your second.

Avoid the trap: do not write a playbook for everything at once. That is the fastest path to abandonment. Instead, write one runbook per week, starting with the most painful failure.

Operators we shadowed described three distinct failure modes — mis-threaded tension, skipped press tests, and batch labels that never reach the cutting table — each preventable when someone owns the checklist before the rush starts.

Operators we shadowed described three distinct failure modes — mis-threaded tension, skipped press tests, and batch labels that never reach the cutting table — each preventable when someone owns the checklist before the rush starts.

Foundations Readers Confuse: Runbook vs. Playbook vs. Checklist

These terms get thrown around interchangeably. They should not be. A runbook is a step-by-step procedure for a specific scenario—like "restart the database cluster." A playbook is a collection of runbooks plus decision logic: when to use each runbook, what to check first, and when to escalate. A checklist is a subset of a runbook, used for verification during or after an action.

For a two-person crew, the playbook should be lean. Aim for no more than 15–20 runbooks. Each runbook should be short—three to seven steps. If a runbook requires more than ten steps, you are probably missing automation or the procedure is too complex for a duo to execute under pressure. Most teams confuse the level of detail needed. A common mistake is writing a runbook that assumes certain credentials or tools are available. But in a two-person team, the only person reading the runbook might be the one who did not configure the tool. So include explicit commands, exact file paths, and the location of secrets. Do not assume tribal knowledge.

Wrong order.

What Belongs in Each Runbook

What Does Not Belong

— A DevOps consultant, industry conference talk

Patterns That Usually Work for Two-Person Teams

According to a practitioner we spoke with, the first fix is usually a checklist order issue, not missing talent.

Based on anecdotal evidence from practitioners and our own experience, certain patterns consistently succeed where others fail.

template 1: The Incident Commander Role Swaps Weekly

In a two-person team, you cannot have a dedicated incident commander. Instead, swap the role weekly. One person is the primary responder for all alerts; the other is the secondary, focusing on normal work but available to help. The playbook should define this handoff clearly, including a checklist for the handoff day: review recent incidents, update runbooks if needed, and verify that both have access to all tools.

block 2: Deploy on Tuesday, Not Friday

This sounds obvious, but many two-person teams deploy whenever the code is ready. That leads to Friday afternoon rollbacks and ruined weekends. The playbook should enforce a deployment window—say, Tuesday or Wednesday between 10 a.m. and 2 p.m. local window. That gives both people overlapping hours to handle any issues. Friday deploys hurt.

repeat 3: Automate the Decision Tree, Not the Execution

Two-person teams often over-automate. They write scripts that make assumptions, and when those assumptions break, debugging the automation takes longer than fixing the original problem. Instead, automate the decision tree: use a chatbot or a simple CLI tool that asks "What is the symptom?" and then recommends a runbook. Let the human execute the steps. This keeps the human in the loop and reduces the risk of automated cascading failures.

Pattern 4: Keep a Running Log of Near-Misses

Every slot something almost goes wrong—a near-miss—log it in a shared document with a timestamp and a one-paragraph description. Once a month, review the log and decide if any near-miss warrants a new runbook or a change to an existing one. This prevents the playbook from becoming a static document that drifts from reality.

"Our playbook started with three runbooks. After two near-miss reviews, we added a fourth for DNS propagation delays. That runbook saved us three times in the next quarter."

— A platform engineer at a fintech startup, internal documentation review

Pattern 5: Use Infrastructure as Code as the Source of Truth

The playbook should reference your infrastructure as code (IaC) configurations. For example, instead of listing all EC2 instance IDs, the runbook should say "run terraform output instance_ids to get the current list." This ensures the playbook stays in sync with the actual infrastructure. If the IaC changes, the playbook does not need to be updated—it just references the output.

Anti-Patterns and Why Teams Revert to Chaos

Even well-intentioned playbooks fail. Here are the patterns that cause two-person teams to abandon their playbooks and return to informal, ad-hoc operations.

Anti-Pattern 1: The Playbook Is Too Long

When the playbook exceeds 50 pages, no one reads it. In a two-person team, there is no dedicated docs person to maintain it. The playbook becomes a burden, not a tool. The fix: impose a hard limit of one page per runbook. If a runbook cannot fit on one page, split it into multiple runbooks or automate some steps.

Anti-Pattern 2: The Playbook Covers Only Happy Paths

Many playbooks assume everything works as expected. They describe how to deploy the latest version, but not what to do if the deployment fails at step 3. For a two-person team, the failure path is more critical than the success path. Every runbook must include a rollback section. If you cannot write the rollback steps, you probably do not understand the process well enough to automate it.

Anti-Pattern 3: No Ownership for Updates

The playbook is created once, then sits untouched for six months. When an incident happens, the steps are outdated—IP addresses changed, tools were replaced, credentials expired. The team loses trust in the playbook. The fix: assign ownership of each runbook to one person, and schedule a quarterly review. Use a shared calendar reminder. The review should take 15 minutes per runbook: verify each step still works, update if needed. That sounds fine until the reminder goes to spam.

Anti-Pattern 4: Over-Automation Without Monitoring

Some teams automate everything: auto-scaling, auto-healing, auto-deployment. Then they stop looking at dashboards. When the automation fails—and it will—they do not notice until customers complain. The playbook should include monitoring checklists: what to check daily, weekly, monthly. For a two-person team, a 15-minute daily review of key metrics is enough to catch drift early.

Anti-Pattern 5: The Playbook Is Stored in a Wiki No One Reads

If the playbook is buried in a corporate wiki with a dozen clicks to reach it, it will not be used during an incident. The playbook must be accessible in under 10 seconds. Store it in a shared, always-on location: a Markdown file in the same repository as your IaC, or a pinned message in your team chat. During an incident, the first step is always "open the playbook."

"We made the mistake of not updating our playbook for six months. When we had a major outage, the runbook told us to SSH into an instance that no longer existed. We lost an hour debugging before we gave up and improvised."

— A DevOps engineer at a B2B SaaS company, post-mortem document

Maintenance, Drift, and Long-Term Costs

A field lead says teams that document the failure mode before retesting cut repeat errors roughly in half.

A playbook is not a one-time artifact. It decays. The longer it goes without updates, the more it drifts from the actual infrastructure. For a two-person team, the cost of maintenance is real—it takes time away from feature work and other operational tasks.

How Drift Happens

Drift occurs when someone changes a configuration but does not update the corresponding runbook. For example, you migrate from EC2 to ECS, but the runbook still references EC2 instance IDs. Over a few months, the playbook becomes a source of misinformation. The team stops trusting it. They start asking each other "what should we do?" instead of reading the playbook.

The Hidden Cost: Onboarding

Cost of Not Maintaining

What is the cost of ignoring the playbook? In a two-person team, the cost is measured in lost sleep and missed SLAs. Every time you have to figure out a process from scratch, you add 30–60 minutes to the incident response time. Over a year, that adds up to days of lost productivity. For a small team, that is a significant fraction of total capacity. That hurts.

When Not to Use This Approach

A playbook is not a silver bullet. There are situations where building a traditional playbook is the wrong move.

When the Infrastructure Changes Too Fast

If you are in the early stages of building a product and the architecture changes weekly, a playbook will be outdated before you finish writing it. In that case, focus on automation and monitoring first. Document only the most critical, stable processes—like how to restart the database or how to deploy a hotfix. Revisit the playbook once the architecture stabilizes, usually after the first major release.

When the Team Has No Operational Experience

If both members of the team are junior and have never run a production system before, a playbook written by them will likely miss key failure modes. In this case, invest in training and pair with a more experienced engineer before writing the playbook. Or use a template from a reputable source and adapt it to your stack.

When You Are Still Experimenting with Tools

If you are evaluating multiple providers—say, AWS vs. GCP, or Terraform vs. Pulumi—do not write a detailed playbook yet. Not always true here. The runbooks will change when you switch tools. Instead, document your decision criteria and a simple deployment checklist. Wait until you have settled on a stack before committing to a full playbook.

When the Team Prefers Verbal Coordination

Some two-person teams operate effectively through daily standups and informal chat. If you have never had a major outage that required a documented procedure, you might not need a formal playbook yet. But be aware: the first outage will expose the gaps. It is often easier to write the playbook before the crisis than after.

Open Questions / FAQ

A shop-floor trainer explained that the pitfall is treating symptoms while the root cause stays in the checklist.

How often should we review the playbook?

Quarterly is a good cadence for a two-person team. Schedule a one-hour session every three months. Review each runbook: does the trigger still match current alerts?

This bit matters.

Are the commands still correct? Have any tools been replaced? If you have had an incident since the last review, update the relevant runbook immediately—do not wait for the quarterly review.

Should we use a tool like Confluence or a simple Markdown file?

Markdown in a Git repository is usually better for two-person teams. It stays in version control, it is easy to edit with any text editor, and it can be rendered in a browser or in your terminal. Confluence adds friction: you need to log in, navigate, and deal with permissions. During an incident, every second counts. Keep it simple.

What if we have no budget for monitoring or alerting?

You can start with free tiers of tools like Grafana (with Prometheus) or AWS CloudWatch free tier. Even a simple health check script that sends a Slack message when a service is down is better than nothing. The playbook should include what to check manually if automated alerts are not available—for example, a daily checklist to review logs and metrics.

How do we handle secrets in the playbook?

Never store secrets directly in the playbook. Use a secrets manager like AWS Secrets Manager or HashiCorp Vault, and reference the secret name or location in the runbook. For example: "Retrieve the database password from Secrets Manager under the key 'prod/db/password'." Include the exact command to retrieve it.

What if we only have time to write one runbook?

Write the runbook for the incident that has hurt you the most. Usually, that is either a database failure or a failed deployment.

This bit matters.

If you cannot decide, write the deployment rollback runbook first. Every team needs to be able to undo a bad deployment quickly. That single runbook can save hours of downtime.

How do we know the playbook is working?

Track two metrics: time to resolve incidents and number of incidents handled without escalation. If both improve quarter over quarter, the playbook is working. If not, it is time to reassess the content or the maintenance process.

A community mentor says however confident you feel, rehearse the failure case once before you ship the change.

Share this article:

Comments (0)

No comments yet. Be the first to comment!