On-Call Rotations That Don't Burn Out Your Team

On-call is where a lot of engineering teams quietly fall apart. You implement it with good intentions—“we need someone available if the site goes down”—and six months later your best engineers are interviewing elsewhere because they got woken up at 2 AM three times in one week for issues that could have waited until morning.

I’ve helped multiple companies redesign their on-call systems after they reached a crisis point: senior engineers refusing to participate, alerts being ignored, or worse, people leaving the company entirely. The problem is almost never that engineers don’t want responsibility—it’s that the on-call system sets them up to fail.

Here’s how to structure on-call rotations that actually work.

The Core Principle: On-Call Should Be Boring

If your on-call rotation is exciting, something is wrong. Good on-call shifts are uneventful—alerts are rare, when they happen they’re actionable, and engineers have the tools and runbooks to resolve issues quickly.

Bad on-call rotations are characterized by:

Alert fatigue: 20+ pages per shift, mostly false alarms
Mystery alerts: “Service X is down” with no context or runbook
Unfixable issues: Alerts for problems that require architectural changes
Unclear escalation: Engineers don’t know when to wake up their manager
No time to fix root causes: Teams spend all their time firefighting

If your on-call engineers are constantly stressed, the problem isn’t the people—it’s the system.

Rotation Structure That Actually Works

Single-tier on-call doesn’t scale. One person being responsible for everything leads to burnout. You need tiers:

Tier 1 (Primary on-call): Receives all alerts, handles known issues, escalates unknowns Tier 2 (Escalation): Senior engineers who can debug complex problems Tier 3 (Management/leadership): Gets involved for major outages or business-critical decisions

Most teams stop at single-tier on-call and wonder why engineers burn out. The primary on-call engineer should be able to resolve 80% of issues without waking anyone else. The other 20% escalates clearly.

# Example on-call schedule structure
Primary Rotation (Tier 1):
  - Duration: 1 week
  - Team: All engineers (rotates)
  - Responsibility: First responder for all alerts
  - SLA: Acknowledge within 15 minutes

Escalation (Tier 2):
  - Duration: 1 week (offset from primary)
  - Team: Senior engineers
  - Responsibility: Handle escalations, assist with complex issues
  - SLA: Respond within 30 minutes of escalation

On-call Manager (Tier 3):
  - Duration: 1 month
  - Team: Engineering managers
  - Responsibility: Major incidents, customer escalations
  - SLA: Respond to Sev-1 incidents

This gives the primary on-call engineer a clear escalation path, reduces pressure, and ensures they’re not alone.

Rotation Length: Not Too Short, Not Too Long

One-day rotations are chaos. Engineers spend the whole day anxious, then hand off without building context. Issues that span multiple days require constant handoffs.

Month-long rotations lead to burnout. Four weeks of interrupted sleep and weekend alerts will break anyone.

One-week rotations are the sweet spot. Engineers can build context, most issues resolve within the week, and people get 3-4 weeks of uninterrupted time between shifts.

Some teams do two-week rotations, which also works. The key is balancing context-building time with recovery time.

Follow-the-sun rotations work well for distributed teams:

# Example follow-the-sun schedule
EMEA shift (9 AM - 5 PM UTC):
  Primary: Engineer in London or Berlin

US shift (9 AM - 5 PM PST / 5 PM - 1 AM UTC):
  Primary: Engineer in SF or NYC

APAC shift (9 AM - 5 PM SGT / 1 AM - 9 AM UTC):
  Primary: Engineer in Singapore or Sydney

This eliminates most after-hours pages. The catch: you need engineers in all time zones and tight handoff processes.

Setting Clear Expectations

On-call stress often comes from unclear expectations. Engineers don’t know:

How fast they need to respond
What issues justify waking up the team
When to escalate vs trying to fix it themselves
Whether they’re expected to work weekends
If they’ll get comp time for after-hours pages

Document this explicitly:

# On-Call Expectations

## Response Times
- Critical alerts (production down): 15 minutes
- High-priority alerts (degraded service): 30 minutes
- Medium-priority alerts (non-customer-facing): 1 hour
- Low-priority alerts: Next business day

## Escalation Criteria
Escalate to Tier 2 if:
- Issue isn't resolved within 30 minutes
- Issue requires database changes in production
- Multiple services are affected
- Customer data may be at risk

Escalate to management if:
- Outage affects >50% of users
- Potential data breach or security incident
- Resolution will take >2 hours

## Compensation
- Pages between 10 PM - 8 AM: +4 hours comp time
- Pages on weekends: +8 hours comp time
- Major incidents (>2 hours): Additional comp time at manager discretion

Comp time must be used within 30 days.

This removes ambiguity. Engineers know exactly what’s expected and when they can go back to sleep without guilt.

Runbooks: The Difference Between Fixable and Awful

The worst on-call pages are the ones with no runbook:

Alert: API latency high
Details: p95 latency >2s
Runbook: (none)

The engineer has to:

Figure out what service is involved
Find the logs
Guess at what’s causing high latency
Try random fixes
Hope something works

This turns a 10-minute fix into a 90-minute investigation at 3 AM.

A good alert includes a runbook:

Alert: API latency high
Details: p95 latency >2.1s (threshold: 2s)

Runbook:
1. Check recent deployments (rollback if within 1 hour)
   Dashboard: https://deploy.company.com

2. Check database connection pool
   Grafana: https://grafana.company.com/db-pool
   If pool saturation >80%, restart API service:
   $ kubectl rollout restart deployment/api-service

3. Check for slow queries
   $ kubectl logs -l app=api --tail=100 | grep "duration"
   If slow query found, check database load in Datadog

4. If none of the above: Escalate to Tier 2

Common causes:
- Database connection pool exhaustion (70% of incidents)
- Upstream dependency timeout (20% of incidents)
- Memory leak after deployment (10% of incidents)

This runbook tells the engineer exactly what to check, how to check it, and when to escalate. They don’t need to be an expert—they just follow the steps.

Runbook checklist:

Link to relevant dashboard/logs
Step-by-step troubleshooting
Common causes (with percentages if possible)
Commands to run (copy-pasteable)
Clear escalation criteria
Context on what the alert means

If an alert doesn’t have a runbook, it shouldn’t be paging people at night.

Reducing Alert Fatigue

Alert fatigue is the fastest way to ruin on-call. When engineers get 30 alerts per shift and 25 of them are false positives, they start ignoring alerts entirely—including the real ones.

Symptoms of alert fatigue:

Alerts are acknowledged without being investigated
Engineers mute alert channels
Pages are ignored during off-hours
Team jokes about “alert spam”

The fix: ruthlessly prune alerts.

Every alert should answer:

Is this actionable? Can the on-call engineer do something right now to fix it?
Is this urgent? Does it need attention within 15 minutes or can it wait until morning?
Is this accurate? Are false positives <5%?

If the answer to any of these is “no,” don’t page for it.

Bad alerts to eliminate:

# Alert: Disk usage >70%
# Problem: 70% isn't an emergency. Page at 90%.

# Alert: Single server down (in cluster of 20)
# Problem: Redundancy handles this. Page if >3 servers down.

# Alert: Deployment started
# Problem: Not actionable. Deployments are normal.

# Alert: Warning logged
# Problem: Warnings aren't emergencies. Track separately.

Good alerts:

# Alert: API error rate >5%
# Why: Users are affected right now, immediate action needed

# Alert: Database replication lag >60s
# Why: Risk of data loss, needs investigation

# Alert: Payment processing queue backed up >1000 jobs
# Why: Customers can't complete orders, revenue impact

# Alert: SSL certificate expires in <7 days
# Why: Site will break if not renewed, but gives time to fix during business hours

Most teams can cut their alert volume by 50-70% by asking “is this actually urgent?”

Compensation and Fairness

Engineers need to feel that on-call is fairly distributed and compensated.

Fair distribution means:

Everyone at the same level participates equally (no special exemptions)
Senior engineers take escalation shifts (not all primary shifts)
New hires ramp up gradually (shadow for 2-4 weeks before solo shifts)
Shift swaps are easy and encouraged

Fair compensation means:

Extra pay for on-call shifts (even if no pages happen)
Additional comp time for after-hours pages
Automatic comp time, not “ask your manager”
Incidents during off-hours = no expectation of regular work the next day

One pattern that works well:

On-call pay:
  Base stipend: $500/week (even if zero pages)
  Per page (off-hours): 4 hours comp time
  Major incident (>2 hours): Additional day off

If you got paged Saturday night at 2 AM:
  - Don't work Monday (use comp time)
  - Still get base stipend
  - No questions asked

This removes the incentive to pretend pages didn’t happen or to “tough it out” and work the next day exhausted.

Post-Incident Reviews Without Blame

Every major incident should have a post-incident review (PIR). The goal isn’t to blame the on-call engineer—it’s to improve the system.

Good PIRs focus on:

Why did this alert? Was the threshold right?
Did the runbook help? What was missing?
Could we have prevented this? (Without saying “engineer should have known X”)
What can we automate? Self-healing systems reduce pages

Bad PIRs focus on:

“Engineer should have done X faster”
“Why didn’t engineer know about Y?”
Blame or shaming language

The on-call engineer should be able to write the PIR themselves without fear. If the culture is “incidents mean someone screwed up,” engineers will hide problems instead of fixing them.

Gradual Improvement > Perfection

You won’t fix on-call overnight. The path looks like:

Month 1:

Document current alert volume and response times
Create runbooks for top 5 most common alerts
Set clear response time expectations

Month 2:

Eliminate or lower severity for non-actionable alerts
Implement tiered rotation if not already in place
Ensure comp time policy is clear and automatic

Month 3:

Review PIRs from month 2, identify patterns
Automate fixes for top 3 most common issues
Update runbooks based on real incidents

Ongoing:

Quarterly review of alert volume and accuracy
Regular runbook updates
Continuous automation of repetitive fixes

The key metric: how many alerts per week per engineer?

0-5 alerts/week: Healthy
6-10 alerts/week: Monitor closely
11-20 alerts/week: Needs improvement
20+ alerts/week: Unsustainable, fix immediately

Track this over time. If the trend is upward, address it before engineers burn out.

What We Actually Do

For most clients, we implement:

One-week primary rotations with two-week escalation rotations
Tiered on-call (primary, escalation, management)
Runbooks for every alert that can page someone off-hours
Aggressive alert pruning (remove 50-70% of low-value alerts)
Comp time policies that are automatic and generous
Follow-the-sun handoffs when team distribution allows

The teams with the healthiest on-call systems treat it as a first-class engineering problem, not an afterthought. They budget time to write runbooks, automate incident responses, and continuously improve the system.

On-call will never be fun. But it can be sustainable, fair, and boring—which is exactly what you want.

Need help designing an on-call rotation that works for your team, or reducing alert fatigue? We can help.

On-Call Rotations That Don't Burn Out Your Team

The Core Principle: On-Call Should Be Boring

Rotation Structure That Actually Works

Rotation Length: Not Too Short, Not Too Long

Setting Clear Expectations

Runbooks: The Difference Between Fixable and Awful

Reducing Alert Fatigue

Compensation and Fairness

Post-Incident Reviews Without Blame

Gradual Improvement > Perfection

What We Actually Do

Continue Reading

Rollback Strategies That Actually Work When You're Half-Asleep

Migrating from CircleCI to GitHub Actions: A Practical Guide

Build vs Buy Decisions: A Framework That Works

Have a Project
In Mind?

The Core Principle: On-Call Should Be Boring

Rotation Structure That Actually Works

Rotation Length: Not Too Short, Not Too Long

Setting Clear Expectations

Runbooks: The Difference Between Fixable and Awful

Reducing Alert Fatigue

Compensation and Fairness

Post-Incident Reviews Without Blame

Gradual Improvement > Perfection

What We Actually Do

Continue Reading

Rollback Strategies That Actually Work When You're Half-Asleep

Migrating from CircleCI to GitHub Actions: A Practical Guide

Build vs Buy Decisions: A Framework That Works

Have a ProjectIn Mind?

Have a Project
In Mind?