Rollback Strategies That Work at 3 AM | Production Incident Guide

It’s 3 AM. Your phone vibrates. PagerDuty says the site is down. You roll out of bed, open your laptop, and confirm: the deployment from six hours ago broke something that only manifests under production load.

You need to roll back. Fast.

This is the reality of on-call rotations—when incidents happen, your rollback strategy either saves you or compounds the problem.

This is when you learn whether your rollback strategy actually works or just looks good in architecture diagrams. We’ve been on both sides of this—helping teams build systems that roll back cleanly and fixing systems where rollback is scarier than moving forward.

Here’s what matters when you need to restore service while half-asleep.

The First Rule: Rollback Must Be Faster Than Fixing Forward

When something breaks in production, you have two options: fix forward (deploy a patch) or roll back (revert to the previous version).

If rolling back takes longer than deploying a fix, you’ll fix forward every time. The rollback strategy becomes theoretical.

What makes rollback slow:

Manual coordination across multiple systems
Database migrations that can’t easily reverse
Configuration that drifted from the deployment
Unclear “last known good” version
Approval processes designed for normal deployments, not emergencies

What makes rollback fast:

One command or button press
Automated, tested rollback procedure
Database migrations designed to be reversible
Clear version pinning and artifact retention
Emergency access patterns that bypass normal approvals

Speed is the entire point. If your rollback process has five steps and requires two people, it won’t get used when it matters.

Strategy 1: Blue-Green Deployment (Keep the Old Version Running)

Blue-green deployments keep two complete environments: one serving traffic (green), one idle (blue). Deploy the new version to blue, test it, switch traffic over. If it breaks, switch back.

Why it works at 3 AM:

Rolling back is just flipping a load balancer or router config. The old version never stopped running. You’re not redeploying—you’re redirecting.

Implementation patterns:

Infrastructure level: Two sets of servers behind a load balancer. Change the target group.
Kubernetes level: Two deployments with different labels. Change the service selector.
DNS level: Two environments with different DNS records. Change the CNAME (slower due to TTL).

Example: Kubernetes blue-green with service selector

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
    version: green  # Switch this to 'blue' to rollback
  ports:
    - port: 80

Change version: green to version: blue and reapply. Traffic shifts immediately.

Tradeoffs:

Doubles infrastructure cost (two environments running simultaneously)
Only works if the old environment can handle current load
Database schema changes complicate blue-green (see below)

When it’s worth it:

Blue-green is expensive but reliable. If downtime costs more than infrastructure, pay for redundancy. SaaS products with SLAs, e-commerce during peak seasons, financial services—these justify the cost.

Strategy 2: Rolling Deployment with Fast Rollback

Rolling deployments update instances gradually—replace 10%, test, replace another 10%, etc. If something breaks, roll back by redeploying the old version using the same process in reverse.

Why it works at 3 AM:

Rolling deployments are standard in Kubernetes and most orchestrators. Rollback is a single command that triggers the same deployment mechanism.

Kubernetes rollback:

kubectl rollout undo deployment/web

This reverts to the previous ReplicaSet, which Kubernetes keeps around automatically.

Example: GitHub Actions deployment with rollback

- name: Rollback on failure
  if: failure()
  run: |
    kubectl rollout undo deployment/web -n production
    kubectl rollout status deployment/web -n production

If the deployment job fails, rollback happens automatically.

Tradeoffs:

Rollback is a redeployment, not instant
If the infrastructure or cluster is unhealthy, rollback may also fail
Requires clear version tagging and artifact retention

When it’s worth it:

Rolling deployments are the default for good reason—they balance risk and resource efficiency. For most applications, this is the right baseline.

Strategy 3: Feature Flags (Don’t Roll Back the Code, Roll Back the Behavior)

Feature flags decouple deployment from release. Deploy code with new features disabled. Enable the feature for a percentage of users. If it breaks, disable the flag.

Why it works at 3 AM:

Rolling back a feature flag is toggling a boolean in a config system or admin UI. No redeployment required. The code stays deployed; the behavior changes.

Example: LaunchDarkly-style flag rollback

if ld_client.variation("new-checkout-flow", user, False):
    return new_checkout()
else:
    return old_checkout()

If new-checkout-flow causes issues, set it to False in the LaunchDarkly dashboard. Instantly, all users see the old checkout flow. For more on when feature flags help and when they hurt, see our feature flags guide.

Tradeoffs:

Adds complexity to the codebase (conditional logic everywhere)
Feature flags need lifecycle management (remove old flags)
Flag system itself becomes a dependency and potential failure point

When it’s worth it:

Feature flags shine for risky changes or gradual rollouts. If you’re redesigning core workflows, rearchitecting APIs, or changing payment flows, flags let you roll back without redeploying.

Strategy 4: Database-Compatible Deployments

Most rollback strategies fail because of database migrations. You can roll back application code easily, but if the new version wrote to database columns the old version doesn’t understand, rolling back the app breaks it differently.

The expand-contract pattern:

Instead of modifying database schema in lockstep with code, expand first, contract later.

Example: Renaming a column

Bad approach (breaks rollback):

Deploy code that reads new_column
Migrate DB: ALTER TABLE users RENAME COLUMN old_column TO new_column
If rollback: Old code looks for old_column, which no longer exists. Broken.

Good approach (rollback-safe):

Migrate DB: ALTER TABLE users ADD COLUMN new_column
Deploy code that writes to both old_column and new_column, reads from new_column
Backfill new_column with data from old_column
Deploy code that only uses new_column
Later, after confidence: ALTER TABLE users DROP COLUMN old_column

If you roll back at any point before step 5, the old code still finds old_column and works.

Why it works at 3 AM:

Database changes don’t break rollback. The old version of the app still functions even after migrations run. This pattern is essential for zero-downtime deployments—see our database migration strategies guide for more details.

Tradeoffs:

Multi-step migrations are slower and more complex
Temporary duplication of data and logic

When it’s worth it:

Always. Database compatibility should be standard practice, not an edge case. It’s the difference between “rollback works” and “rollback makes things worse.”

Strategy 5: Immutable Infrastructure (Redeploy from Artifact)

Immutable infrastructure means never modifying running systems. Instead, build a new image or AMI with the desired version and replace instances.

Rollback is deploying an image from an earlier build.

Example: Terraform with versioned AMIs

resource "aws_launch_template" "web" {
  image_id = "ami-12345678"  # Previous version
}

Change the AMI ID to an earlier one, apply Terraform, and the auto-scaling group replaces instances.

Why it works at 3 AM:

Artifacts are immutable and versioned. You’re not wondering “what was running before?"—you have the exact AMI, Docker tag, or binary.

Tradeoffs:

Requires disciplined artifact management and retention policies
Slower than in-place rollbacks (new instances must start)

When it’s worth it:

Immutable infrastructure pairs well with blue-green or canary deployments. If you’re already treating servers as cattle, rollback via redeployment is natural.

What Actually Happens at 3 AM

Theory is clean. Practice is messy. Here’s what rollback failures look like in the real world:

“We didn’t keep the old Docker image.”

The registry only keeps the last five tags. The version from a week ago was purged. Rollback means rebuilding from git—if the build still works.

Solution: Retention policies that keep at least 30 days of production images.

“The database migration can’t reverse.”

A migration deleted a column. Rolling back the code doesn’t undelete the column.

Solution: Additive migrations only. Never drop columns in the same release as code changes.

“We don’t remember which version was good.”

Five deployments happened today. Which one do we roll back to?

Solution: Tag deployments with timestamps and commit SHAs. Link deployments to monitoring dashboards.

“The rollback broke differently.”

The old version has a bug that only triggers with the current data shape.

Solution: Automated rollback testing. Periodically test that the previous version can deploy and run against current data.

“We need approval to deploy to production.”

The rollback command exists, but it’s technically a deployment, which requires manager approval and a change ticket.

Solution: Emergency access procedures. On-call engineers should have rollback authority without approvals.

Building Rollback Into Your Process

Rollback isn’t a feature you add at the end. It’s a design principle.

In CI/CD pipelines:

Tag every production deployment with version, commit SHA, timestamp
Keep artifacts (Docker images, binaries, AMIs) for at least 30 days
Test rollback in staging before deploying to production

In database migrations:

Never drop columns or tables in the same release as code changes
Use expand-contract for schema changes
Make migrations reversible or additive-only

In monitoring:

Link deployments to metrics dashboards
Automatic alerts when error rates spike post-deployment
Clear “last known good” version visible in dashboards
Proper observability to catch issues fast (see our observability guide)

In access control:

On-call engineers can execute rollbacks without approvals
Rollback commands are audited but not blocked
Runbooks include one-line rollback commands

Testing Your Rollback

If you haven’t tested rollback, you don’t have rollback.

Quarterly drill:

Deploy a new version to production
Immediately roll back to the previous version
Verify the rollback worked and the application functions correctly
Document what failed or was confusing

This sounds excessive, but the first time you discover your rollback doesn’t work shouldn’t be during an outage.

Automated rollback testing:

Some teams run automated tests that deploy version N, then roll back to version N-1, and verify functionality. This catches incompatibilities early.

The Bottom Line

Rollback strategies that work at 3 AM have three properties:

Fast: One command or button, not a procedure
Tested: Regularly exercised, not theoretical
Safe: Database-compatible, artifact-retained, well-understood

Blue-green is fast but expensive. Feature flags decouple code from behavior. Rolling deployments balance cost and safety. Immutable infrastructure enforces discipline.

Pick the strategy that fits your risk tolerance and operational complexity. Then test it when you’re awake, so it works when you’re not.

The goal isn’t perfection. The goal is getting back to a working state faster than the incident spirals. Rollback is a parachute—you hope you never need it, but you pack it carefully just in case.

Dealing with frequent production incidents or unclear rollback procedures? We help teams design deployment pipelines that actually work under pressure—blue-green deployments, automated rollback strategies, and database migration patterns that don’t break in production. Learn how we transformed one team’s deployment process to achieve 95% fewer deployment failures, or explore our cloud platform engineering services to build resilient infrastructure.

Rollback Strategies That Actually Work When You're Half-Asleep

The First Rule: Rollback Must Be Faster Than Fixing Forward

Strategy 1: Blue-Green Deployment (Keep the Old Version Running)

Strategy 2: Rolling Deployment with Fast Rollback

Strategy 3: Feature Flags (Don’t Roll Back the Code, Roll Back the Behavior)

Strategy 4: Database-Compatible Deployments

Strategy 5: Immutable Infrastructure (Redeploy from Artifact)

What Actually Happens at 3 AM

Building Rollback Into Your Process

Testing Your Rollback

The Bottom Line

Continue Reading

Migrating from CircleCI to GitHub Actions: A Practical Guide

Build vs Buy Decisions: A Framework That Works

Rate Limiting: Algorithms and Implementation

Have a Project
In Mind?

The First Rule: Rollback Must Be Faster Than Fixing Forward

Strategy 1: Blue-Green Deployment (Keep the Old Version Running)

Strategy 2: Rolling Deployment with Fast Rollback

Strategy 3: Feature Flags (Don’t Roll Back the Code, Roll Back the Behavior)

Strategy 4: Database-Compatible Deployments

Strategy 5: Immutable Infrastructure (Redeploy from Artifact)

What Actually Happens at 3 AM

Building Rollback Into Your Process

Testing Your Rollback

The Bottom Line

Continue Reading

Migrating from CircleCI to GitHub Actions: A Practical Guide

Build vs Buy Decisions: A Framework That Works

Rate Limiting: Algorithms and Implementation

Have a ProjectIn Mind?

Have a Project
In Mind?