On-call is where a lot of engineering teams quietly fall apart. You implement it with good intentions—“we need someone available if the site goes down”—and six months later your best engineers are interviewing elsewhere because they got woken up at 2 AM three times in one week for issues that could have waited until morning.
I’ve helped multiple companies redesign their on-call systems after they reached a crisis point: senior engineers refusing to participate, alerts being ignored, or worse, people leaving the company entirely. The problem is almost never that engineers don’t want responsibility—it’s that the on-call system sets them up to fail.
Here’s how to structure on-call rotations that actually work.
The Core Principle: On-Call Should Be Boring
If your on-call rotation is exciting, something is wrong. Good on-call shifts are uneventful—alerts are rare, when they happen they’re actionable, and engineers have the tools and runbooks to resolve issues quickly.
Bad on-call rotations are characterized by:
- Alert fatigue: 20+ pages per shift, mostly false alarms
- Mystery alerts: “Service X is down” with no context or runbook
- Unfixable issues: Alerts for problems that require architectural changes
- Unclear escalation: Engineers don’t know when to wake up their manager
- No time to fix root causes: Teams spend all their time firefighting
If your on-call engineers are constantly stressed, the problem isn’t the people—it’s the system.
Rotation Structure That Actually Works
Single-tier on-call doesn’t scale. One person being responsible for everything leads to burnout. You need tiers:
Tier 1 (Primary on-call): Receives all alerts, handles known issues, escalates unknowns Tier 2 (Escalation): Senior engineers who can debug complex problems Tier 3 (Management/leadership): Gets involved for major outages or business-critical decisions
Most teams stop at single-tier on-call and wonder why engineers burn out. The primary on-call engineer should be able to resolve 80% of issues without waking anyone else. The other 20% escalates clearly.
# Example on-call schedule structure
Primary Rotation (Tier 1):
- Duration: 1 week
- Team: All engineers (rotates)
- Responsibility: First responder for all alerts
- SLA: Acknowledge within 15 minutes
Escalation (Tier 2):
- Duration: 1 week (offset from primary)
- Team: Senior engineers
- Responsibility: Handle escalations, assist with complex issues
- SLA: Respond within 30 minutes of escalation
On-call Manager (Tier 3):
- Duration: 1 month
- Team: Engineering managers
- Responsibility: Major incidents, customer escalations
- SLA: Respond to Sev-1 incidents
This gives the primary on-call engineer a clear escalation path, reduces pressure, and ensures they’re not alone.
Rotation Length: Not Too Short, Not Too Long
One-day rotations are chaos. Engineers spend the whole day anxious, then hand off without building context. Issues that span multiple days require constant handoffs.
Month-long rotations lead to burnout. Four weeks of interrupted sleep and weekend alerts will break anyone.
One-week rotations are the sweet spot. Engineers can build context, most issues resolve within the week, and people get 3-4 weeks of uninterrupted time between shifts.
Some teams do two-week rotations, which also works. The key is balancing context-building time with recovery time.
Follow-the-sun rotations work well for distributed teams:
# Example follow-the-sun schedule
EMEA shift (9 AM - 5 PM UTC):
Primary: Engineer in London or Berlin
US shift (9 AM - 5 PM PST / 5 PM - 1 AM UTC):
Primary: Engineer in SF or NYC
APAC shift (9 AM - 5 PM SGT / 1 AM - 9 AM UTC):
Primary: Engineer in Singapore or Sydney
This eliminates most after-hours pages. The catch: you need engineers in all time zones and tight handoff processes.
Setting Clear Expectations
On-call stress often comes from unclear expectations. Engineers don’t know:
- How fast they need to respond
- What issues justify waking up the team
- When to escalate vs trying to fix it themselves
- Whether they’re expected to work weekends
- If they’ll get comp time for after-hours pages
Document this explicitly:
# On-Call Expectations
## Response Times
- Critical alerts (production down): 15 minutes
- High-priority alerts (degraded service): 30 minutes
- Medium-priority alerts (non-customer-facing): 1 hour
- Low-priority alerts: Next business day
## Escalation Criteria
Escalate to Tier 2 if:
- Issue isn't resolved within 30 minutes
- Issue requires database changes in production
- Multiple services are affected
- Customer data may be at risk
Escalate to management if:
- Outage affects >50% of users
- Potential data breach or security incident
- Resolution will take >2 hours
## Compensation
- Pages between 10 PM - 8 AM: +4 hours comp time
- Pages on weekends: +8 hours comp time
- Major incidents (>2 hours): Additional comp time at manager discretion
Comp time must be used within 30 days.
This removes ambiguity. Engineers know exactly what’s expected and when they can go back to sleep without guilt.
Runbooks: The Difference Between Fixable and Awful
The worst on-call pages are the ones with no runbook:
Alert: API latency high
Details: p95 latency >2s
Runbook: (none)
The engineer has to:
- Figure out what service is involved
- Find the logs
- Guess at what’s causing high latency
- Try random fixes
- Hope something works
This turns a 10-minute fix into a 90-minute investigation at 3 AM.
A good alert includes a runbook:
Alert: API latency high
Details: p95 latency >2.1s (threshold: 2s)
Runbook:
1. Check recent deployments (rollback if within 1 hour)
Dashboard: https://deploy.company.com
2. Check database connection pool
Grafana: https://grafana.company.com/db-pool
If pool saturation >80%, restart API service:
$ kubectl rollout restart deployment/api-service
3. Check for slow queries
$ kubectl logs -l app=api --tail=100 | grep "duration"
If slow query found, check database load in Datadog
4. If none of the above: Escalate to Tier 2
Common causes:
- Database connection pool exhaustion (70% of incidents)
- Upstream dependency timeout (20% of incidents)
- Memory leak after deployment (10% of incidents)
This runbook tells the engineer exactly what to check, how to check it, and when to escalate. They don’t need to be an expert—they just follow the steps.
Runbook checklist:
- Link to relevant dashboard/logs
- Step-by-step troubleshooting
- Common causes (with percentages if possible)
- Commands to run (copy-pasteable)
- Clear escalation criteria
- Context on what the alert means
If an alert doesn’t have a runbook, it shouldn’t be paging people at night.
Reducing Alert Fatigue
Alert fatigue is the fastest way to ruin on-call. When engineers get 30 alerts per shift and 25 of them are false positives, they start ignoring alerts entirely—including the real ones.
Symptoms of alert fatigue:
- Alerts are acknowledged without being investigated
- Engineers mute alert channels
- Pages are ignored during off-hours
- Team jokes about “alert spam”
The fix: ruthlessly prune alerts.
Every alert should answer:
- Is this actionable? Can the on-call engineer do something right now to fix it?
- Is this urgent? Does it need attention within 15 minutes or can it wait until morning?
- Is this accurate? Are false positives <5%?
If the answer to any of these is “no,” don’t page for it.
Bad alerts to eliminate:
# Alert: Disk usage >70%
# Problem: 70% isn't an emergency. Page at 90%.
# Alert: Single server down (in cluster of 20)
# Problem: Redundancy handles this. Page if >3 servers down.
# Alert: Deployment started
# Problem: Not actionable. Deployments are normal.
# Alert: Warning logged
# Problem: Warnings aren't emergencies. Track separately.
Good alerts:
# Alert: API error rate >5%
# Why: Users are affected right now, immediate action needed
# Alert: Database replication lag >60s
# Why: Risk of data loss, needs investigation
# Alert: Payment processing queue backed up >1000 jobs
# Why: Customers can't complete orders, revenue impact
# Alert: SSL certificate expires in <7 days
# Why: Site will break if not renewed, but gives time to fix during business hours
Most teams can cut their alert volume by 50-70% by asking “is this actually urgent?”
Compensation and Fairness
Engineers need to feel that on-call is fairly distributed and compensated.
Fair distribution means:
- Everyone at the same level participates equally (no special exemptions)
- Senior engineers take escalation shifts (not all primary shifts)
- New hires ramp up gradually (shadow for 2-4 weeks before solo shifts)
- Shift swaps are easy and encouraged
Fair compensation means:
- Extra pay for on-call shifts (even if no pages happen)
- Additional comp time for after-hours pages
- Automatic comp time, not “ask your manager”
- Incidents during off-hours = no expectation of regular work the next day
One pattern that works well:
On-call pay:
Base stipend: $500/week (even if zero pages)
Per page (off-hours): 4 hours comp time
Major incident (>2 hours): Additional day off
If you got paged Saturday night at 2 AM:
- Don't work Monday (use comp time)
- Still get base stipend
- No questions asked
This removes the incentive to pretend pages didn’t happen or to “tough it out” and work the next day exhausted.
Post-Incident Reviews Without Blame
Every major incident should have a post-incident review (PIR). The goal isn’t to blame the on-call engineer—it’s to improve the system.
Good PIRs focus on:
- Why did this alert? Was the threshold right?
- Did the runbook help? What was missing?
- Could we have prevented this? (Without saying “engineer should have known X”)
- What can we automate? Self-healing systems reduce pages
Bad PIRs focus on:
- “Engineer should have done X faster”
- “Why didn’t engineer know about Y?”
- Blame or shaming language
The on-call engineer should be able to write the PIR themselves without fear. If the culture is “incidents mean someone screwed up,” engineers will hide problems instead of fixing them.
Gradual Improvement > Perfection
You won’t fix on-call overnight. The path looks like:
Month 1:
- Document current alert volume and response times
- Create runbooks for top 5 most common alerts
- Set clear response time expectations
Month 2:
- Eliminate or lower severity for non-actionable alerts
- Implement tiered rotation if not already in place
- Ensure comp time policy is clear and automatic
Month 3:
- Review PIRs from month 2, identify patterns
- Automate fixes for top 3 most common issues
- Update runbooks based on real incidents
Ongoing:
- Quarterly review of alert volume and accuracy
- Regular runbook updates
- Continuous automation of repetitive fixes
The key metric: how many alerts per week per engineer?
- 0-5 alerts/week: Healthy
- 6-10 alerts/week: Monitor closely
- 11-20 alerts/week: Needs improvement
- 20+ alerts/week: Unsustainable, fix immediately
Track this over time. If the trend is upward, address it before engineers burn out.
What We Actually Do
For most clients, we implement:
- One-week primary rotations with two-week escalation rotations
- Tiered on-call (primary, escalation, management)
- Runbooks for every alert that can page someone off-hours
- Aggressive alert pruning (remove 50-70% of low-value alerts)
- Comp time policies that are automatic and generous
- Follow-the-sun handoffs when team distribution allows
The teams with the healthiest on-call systems treat it as a first-class engineering problem, not an afterthought. They budget time to write runbooks, automate incident responses, and continuously improve the system.
On-call will never be fun. But it can be sustainable, fair, and boring—which is exactly what you want.
Need help designing an on-call rotation that works for your team, or reducing alert fatigue? We can help.