Engineering

Designing On-Call Rotations People Don't Quit

Ravinder·October 22, 2025·10 min read

EngineeringOn-CallReliabilityEngineering Management

Designing On-Call Rotations People Don't Quit

Two engineers quit in the same month, both citing on-call as the reason. Neither gave on-call as their official exit interview answer — they said "better opportunity" and "career growth." But in their off-the-record conversations, the pattern was clear: they hadn't slept properly in months, their weekends were regularly interrupted, and they felt the interruptions were preventing them from doing good work during the week. They were right on all three counts.

The on-call experience at most engineering organizations is poorly designed, poorly compensated, and poorly scoped. It's treated as a fact of engineering life rather than an operational system that can be engineered. This post is about treating it like a system.

The Two Failure Modes of On-Call

Most organizations with on-call problems have one of two failure modes, and they require different fixes.

Failure mode 1: Alert volume is too high. Engineers are paged frequently, including for non-actionable conditions. Every page requires at least some cognitive engagement — even if you determine within 60 seconds that the alert is noise, you've woken up, checked your phone, and now need to fall back asleep. Above roughly 3-4 pages per shift, sleep quality degrades to the point where daytime engineering performance is meaningfully impaired.

Failure mode 2: Rotation is too small. The on-call burden is concentrated on a few engineers, usually the senior ones who know the system best. Six engineers on a rotation, each on-call one week in six, means each engineer is on-call 8.5 weeks per year — including holidays, family events, and their own health situations. If some engineers opt out (legitimate reasons: medical, family, preference), the remaining engineers carry more.

These failure modes often coexist. The right fix for high alert volume is engineering investment in alert quality. The right fix for small rotation is hiring or scope reduction. Treating one as the other produces the wrong interventions.

graph TD A[On-Call Burnout] --> B{Root Cause?} B -->|High alert volume| C[Alert hygiene investment] B -->|Small rotation| D[Expand rotation or reduce scope] B -->|Both| E[Parallel tracks] C --> F[Page classification] C --> G[Alert budget per week] D --> H[Junior engineers with graduated ownership] D --> I[Service ownership consolidation] E --> F & H

Alert Budget: The Concept and the Math

An alert budget is a hard weekly limit on the number of actionable pages an on-call engineer should receive. Setting this number forces conversations that need to happen but usually don't.

Our current alert budget: 8 actionable pages per week per on-call engineer. Above that threshold, the rotation is defined as overloaded and we trigger an automatic review of alert configuration for the responsible service.

The math behind the number: an engineer on-call for a week has 168 hours. If we target 7-8 hours of sleep per night and estimate each page costs 30 minutes of sleep disruption (including return to sleep), 8 pages costs 4 hours of sleep — roughly 7% of total sleep time. Beyond that, we're into a range where next-day performance degradation becomes measurable (and has been measured in the research literature on sleep and cognitive performance).

This is not a soft preference — we track it as an engineering metric and it triggers engineering work.

# Weekly alert volume tracker — runs in our monitoring platform
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List
 
ALERT_BUDGET_PER_WEEK = 8
 
@dataclass
class Page:
    timestamp: datetime
    service: str
    alert_name: str
    actionable: bool  # False = noise, True = required engineer action
    resolution_time_minutes: int
 
def compute_weekly_alert_load(pages: List[Page], week_start: datetime) -> dict:
    week_end = week_start + timedelta(days=7)
    week_pages = [
        p for p in pages
        if week_start <= p.timestamp < week_end and p.actionable
    ]
 
    overbudget = len(week_pages) > ALERT_BUDGET_PER_WEEK
    noise_rate = sum(1 for p in pages if not p.actionable) / len(pages) if pages else 0
 
    return {
        "actionable_pages": len(week_pages),
        "over_budget": overbudget,
        "noise_rate": noise_rate,
        "total_resolution_minutes": sum(p.resolution_time_minutes for p in week_pages),
        "services": list({p.service for p in week_pages}),
    }

Alert Classification: The Triage You Do Once

Before you can reduce alert volume, you need to classify what you have. We run a quarterly alert audit with a simple three-category framework:

Actionable: The page requires a human decision or action. If the engineer doesn't respond, something bad happens. This is the only category that should wake someone up.

Informational: Something happened that an engineer should know about, but not urgently. These go to a Slack channel, not to PagerDuty.

Noise: The alert fires but doesn't correspond to a real problem (false positive) or corresponds to a problem the system self-heals from. These should be deleted, not muted.

The audit is mechanical: we pull the last 90 days of alert history and for each alert that fired, ask: "Was there a human action taken in response?" If no action was taken more than 80% of the time, the alert is noise. If action was taken but could have waited until morning, it's informational.

Most organizations find that 30-50% of their alerts are noise on first audit. After the first cleanup, the noise rate drops significantly, but without quarterly audits it climbs back.

# PagerDuty / OpsGenie escalation policy — per service
escalation_policy:
  name: "Payments Service On-Call"
  rules:
    - targets:
        - type: "schedule"
          id: "payments-primary-rotation"
      delay_in_minutes: 0  # Immediate for P1
    - targets:
        - type: "schedule"
          id: "payments-secondary-rotation"
      delay_in_minutes: 10  # Escalate if no ack in 10 min
    - targets:
        - type: "user"
          id: "engineering-manager-payments"
      delay_in_minutes: 20  # Manager escalation if still no ack
 
# Separate low-urgency policy — no wakeup
low_urgency_policy:
  name: "Payments Service Informational"
  rules:
    - targets:
        - type: "slack_channel"
          id: "payments-alerts-informational"
      delay_in_minutes: 0
      urgency: "low"

Rotation Design: Size, Scope, and Graduated Ownership

A well-designed rotation has three properties:

Large enough that each engineer is on-call at most 1 week in 6 (preferably 1 in 8 or better)
Scoped to a system the on-call engineer understands well enough to diagnose without expert assistance for 80% of incidents
Structured to develop junior engineers, not just protect them from on-call

That third property is frequently skipped, and it's what keeps rotations small. If junior engineers are never on-call, they never build the operational fluency that would make them effective on-call engineers — so the rotation stays senior-heavy, stays small, and the seniors stay overloaded.

Our graduated ownership model:

journey title On-Call Maturity Journey section Shadow Read all runbooks: 5: Junior Shadow P2/P3 incidents: 5: Junior Participate in postmortems: 4: Junior section Supported On-call with senior backup: 4: Junior Handle P3/P4 independently: 5: Junior Escalate P1/P2 to senior: 3: Junior section Independent Full primary on-call: 5: Mid Handle P2 independently: 5: Mid Escalate P1 with judgment: 4: Mid section Lead Mentor juniors during incidents: 5: Senior Own postmortem process: 5: Senior Drive alert reduction work: 5: Senior

The shadow phase is not passive. Junior engineers are expected to read every runbook for systems they'll be on-call for, to follow every incident in their team's Slack channel, and to participate in postmortems. We track participation, not just attendance.

The supported phase uses a "two-person primary" model: junior and senior are both listed as primary on-call, with an explicit agreement that the junior handles everything P3 and below and escalates P1/P2 to the senior. This gives the junior real experience without the full blast radius exposure.

Compensation Models That Don't Create Resentment

Engineers have strong opinions about on-call compensation, and those opinions vary widely by career stage, personal situation, and cultural context. There's no universal right answer, but there are clear wrong answers.

Wrong answer 1: No compensation. On-call is part of the job description. This works when the on-call burden is genuinely low. When alert volume is high and on-call nights disrupt sleep and weekends, "it's part of the job" becomes a retention risk for every engineer who has better options.

Wrong answer 2: Time-off-in-lieu (TOIL) only, without quantity limits. TOIL that accumulates faster than it can be used becomes a liability on the books and a frustration for the engineer who can't actually take it.

What works in practice is a combination:

Fixed weekly on-call stipend (we use $150/week for primary, $75/week for secondary) — regardless of pages received
Additional per-page compensation for pages received between 10pm and 7am local time ($25/page, paid out monthly)
TOIL for any shift where actionable pages exceeded the weekly budget, capped at 1 day per week

The per-page late-night compensation is the component that creates the strongest incentive alignment: engineers on the receiving end of noisy alerts have a direct stake in the alert reduction work, because fewer noisy late-night pages means less disruption, even if the base compensation is unchanged.

Escalation Design and the SRE Handoff

Most teams conflate "escalation" with "failure." Escalating an incident to a more senior engineer is treated as an admission that the on-call engineer couldn't handle it. This framing discourages escalation and leads to incidents lasting longer than they should.

We explicitly design escalation as a normal, expected path. The runbook for every critical service has a section: "When to escalate and to whom." The answer is specific: "If you've been working on this for 15 minutes without a recovery path, escalate to the payments lead engineer. If no response in 5 minutes, escalate to engineering manager." Names and contact information, not job titles.

For organizations with a dedicated SRE or platform team, the handoff protocol matters. When an incident crosses from "service team can handle it" to "requires infrastructure-level intervention," the handoff needs to be fast and complete:

Current state of the system (what's broken, what's been tried)
Hypothesis being worked on
What you need from the platform team specifically
Whether you're staying on or handing off completely

The worst handoffs are incomplete: "here's the ticket, I've been looking at it" with no context. The person receiving the handoff spends 20 minutes reconstructing what the first responder already knew.

Measuring On-Call Health

We report five metrics monthly to the engineering leadership team:

Metric	Target	Danger Zone
Avg actionable pages / engineer / week	< 5	> 8
Alert noise rate	< 20%	> 40%
Mean time to acknowledge (MTTA)	< 5 min	> 15 min
Mean time to resolve (MTTR)	< 45 min	> 2 hours
On-call engineer satisfaction (quarterly survey, 1-10)	> 7	< 5

The satisfaction survey is the one that moves engineering managers to act. Page volume and MTTR are abstract; "my team rates on-call a 4 out of 10" is not.

Key Takeaways

On-call burnout has two root causes that require different fixes: high alert volume (an engineering problem) and small rotation size (a staffing or scope problem) — diagnosing which you have matters before you intervene.
An explicit alert budget (we use 8 actionable pages per week) creates the pressure that drives engineering investment in alert quality; without the constraint, noisy alerts accumulate indefinitely.
Quarterly alert audits using the actionable / informational / noise classification consistently find 30-50% of alerts are noise and should be deleted, not muted.
Graduated on-call ownership (shadow → supported → independent → lead) develops junior engineers' operational fluency without exposing them to full blast radius — this is how you grow rotation size organically.
On-call compensation should include both a base weekly stipend and per-page late-night compensation; the per-page component creates direct incentive alignment between on-call engineers and alert reduction work.
Escalation should be explicitly designed as a normal, expected path with specific names and timeframes — framing escalation as failure discourages it and makes incidents last longer than necessary.