It’s 3 AM. Your phone vibrates. PagerDuty says the site is down. You roll out of bed, open your laptop, and confirm: the deployment from six hours ago broke something that only manifests under production load.
You need to roll back. Fast.
This is the reality of on-call rotations—when incidents happen, your rollback strategy either saves you or compounds the problem.
This is when you learn whether your rollback strategy actually works or just looks good in architecture diagrams. We’ve been on both sides of this—helping teams build systems that roll back cleanly and fixing systems where rollback is scarier than moving forward.
Here’s what matters when you need to restore service while half-asleep.
The First Rule: Rollback Must Be Faster Than Fixing Forward
When something breaks in production, you have two options: fix forward (deploy a patch) or roll back (revert to the previous version).
If rolling back takes longer than deploying a fix, you’ll fix forward every time. The rollback strategy becomes theoretical.
What makes rollback slow:
- Manual coordination across multiple systems
- Database migrations that can’t easily reverse
- Configuration that drifted from the deployment
- Unclear “last known good” version
- Approval processes designed for normal deployments, not emergencies
What makes rollback fast:
- One command or button press
- Automated, tested rollback procedure
- Database migrations designed to be reversible
- Clear version pinning and artifact retention
- Emergency access patterns that bypass normal approvals
Speed is the entire point. If your rollback process has five steps and requires two people, it won’t get used when it matters.
Strategy 1: Blue-Green Deployment (Keep the Old Version Running)
Blue-green deployments keep two complete environments: one serving traffic (green), one idle (blue). Deploy the new version to blue, test it, switch traffic over. If it breaks, switch back.
Why it works at 3 AM:
Rolling back is just flipping a load balancer or router config. The old version never stopped running. You’re not redeploying—you’re redirecting.
Implementation patterns:
- Infrastructure level: Two sets of servers behind a load balancer. Change the target group.
- Kubernetes level: Two deployments with different labels. Change the service selector.
- DNS level: Two environments with different DNS records. Change the CNAME (slower due to TTL).
Example: Kubernetes blue-green with service selector
apiVersion: v1
kind: Service
metadata:
name: web
spec:
selector:
app: web
version: green # Switch this to 'blue' to rollback
ports:
- port: 80
Change version: green to version: blue and reapply. Traffic shifts immediately.
Tradeoffs:
- Doubles infrastructure cost (two environments running simultaneously)
- Only works if the old environment can handle current load
- Database schema changes complicate blue-green (see below)
When it’s worth it:
Blue-green is expensive but reliable. If downtime costs more than infrastructure, pay for redundancy. SaaS products with SLAs, e-commerce during peak seasons, financial services—these justify the cost.
Strategy 2: Rolling Deployment with Fast Rollback
Rolling deployments update instances gradually—replace 10%, test, replace another 10%, etc. If something breaks, roll back by redeploying the old version using the same process in reverse.
Why it works at 3 AM:
Rolling deployments are standard in Kubernetes and most orchestrators. Rollback is a single command that triggers the same deployment mechanism.
Kubernetes rollback:
kubectl rollout undo deployment/web
This reverts to the previous ReplicaSet, which Kubernetes keeps around automatically.
Example: GitHub Actions deployment with rollback
- name: Rollback on failure
if: failure()
run: |
kubectl rollout undo deployment/web -n production
kubectl rollout status deployment/web -n production
If the deployment job fails, rollback happens automatically.
Tradeoffs:
- Rollback is a redeployment, not instant
- If the infrastructure or cluster is unhealthy, rollback may also fail
- Requires clear version tagging and artifact retention
When it’s worth it:
Rolling deployments are the default for good reason—they balance risk and resource efficiency. For most applications, this is the right baseline.
Strategy 3: Feature Flags (Don’t Roll Back the Code, Roll Back the Behavior)
Feature flags decouple deployment from release. Deploy code with new features disabled. Enable the feature for a percentage of users. If it breaks, disable the flag.
Why it works at 3 AM:
Rolling back a feature flag is toggling a boolean in a config system or admin UI. No redeployment required. The code stays deployed; the behavior changes.
Example: LaunchDarkly-style flag rollback
if ld_client.variation("new-checkout-flow", user, False):
return new_checkout()
else:
return old_checkout()
If new-checkout-flow causes issues, set it to False in the LaunchDarkly dashboard. Instantly, all users see the old checkout flow. For more on when feature flags help and when they hurt, see our feature flags guide.
Tradeoffs:
- Adds complexity to the codebase (conditional logic everywhere)
- Feature flags need lifecycle management (remove old flags)
- Flag system itself becomes a dependency and potential failure point
When it’s worth it:
Feature flags shine for risky changes or gradual rollouts. If you’re redesigning core workflows, rearchitecting APIs, or changing payment flows, flags let you roll back without redeploying.
Strategy 4: Database-Compatible Deployments
Most rollback strategies fail because of database migrations. You can roll back application code easily, but if the new version wrote to database columns the old version doesn’t understand, rolling back the app breaks it differently.
The expand-contract pattern:
Instead of modifying database schema in lockstep with code, expand first, contract later.
Example: Renaming a column
Bad approach (breaks rollback):
- Deploy code that reads
new_column - Migrate DB:
ALTER TABLE users RENAME COLUMN old_column TO new_column - If rollback: Old code looks for
old_column, which no longer exists. Broken.
Good approach (rollback-safe):
- Migrate DB:
ALTER TABLE users ADD COLUMN new_column - Deploy code that writes to both
old_columnandnew_column, reads fromnew_column - Backfill
new_columnwith data fromold_column - Deploy code that only uses
new_column - Later, after confidence:
ALTER TABLE users DROP COLUMN old_column
If you roll back at any point before step 5, the old code still finds old_column and works.
Why it works at 3 AM:
Database changes don’t break rollback. The old version of the app still functions even after migrations run. This pattern is essential for zero-downtime deployments—see our database migration strategies guide for more details.
Tradeoffs:
- Multi-step migrations are slower and more complex
- Temporary duplication of data and logic
When it’s worth it:
Always. Database compatibility should be standard practice, not an edge case. It’s the difference between “rollback works” and “rollback makes things worse.”
Strategy 5: Immutable Infrastructure (Redeploy from Artifact)
Immutable infrastructure means never modifying running systems. Instead, build a new image or AMI with the desired version and replace instances.
Rollback is deploying an image from an earlier build.
Example: Terraform with versioned AMIs
resource "aws_launch_template" "web" {
image_id = "ami-12345678" # Previous version
}
Change the AMI ID to an earlier one, apply Terraform, and the auto-scaling group replaces instances.
Why it works at 3 AM:
Artifacts are immutable and versioned. You’re not wondering “what was running before?"—you have the exact AMI, Docker tag, or binary.
Tradeoffs:
- Requires disciplined artifact management and retention policies
- Slower than in-place rollbacks (new instances must start)
When it’s worth it:
Immutable infrastructure pairs well with blue-green or canary deployments. If you’re already treating servers as cattle, rollback via redeployment is natural.
What Actually Happens at 3 AM
Theory is clean. Practice is messy. Here’s what rollback failures look like in the real world:
“We didn’t keep the old Docker image.”
The registry only keeps the last five tags. The version from a week ago was purged. Rollback means rebuilding from git—if the build still works.
Solution: Retention policies that keep at least 30 days of production images.
“The database migration can’t reverse.”
A migration deleted a column. Rolling back the code doesn’t undelete the column.
Solution: Additive migrations only. Never drop columns in the same release as code changes.
“We don’t remember which version was good.”
Five deployments happened today. Which one do we roll back to?
Solution: Tag deployments with timestamps and commit SHAs. Link deployments to monitoring dashboards.
“The rollback broke differently.”
The old version has a bug that only triggers with the current data shape.
Solution: Automated rollback testing. Periodically test that the previous version can deploy and run against current data.
“We need approval to deploy to production.”
The rollback command exists, but it’s technically a deployment, which requires manager approval and a change ticket.
Solution: Emergency access procedures. On-call engineers should have rollback authority without approvals.
Building Rollback Into Your Process
Rollback isn’t a feature you add at the end. It’s a design principle.
In CI/CD pipelines:
- Tag every production deployment with version, commit SHA, timestamp
- Keep artifacts (Docker images, binaries, AMIs) for at least 30 days
- Test rollback in staging before deploying to production
In database migrations:
- Never drop columns or tables in the same release as code changes
- Use expand-contract for schema changes
- Make migrations reversible or additive-only
In monitoring:
- Link deployments to metrics dashboards
- Automatic alerts when error rates spike post-deployment
- Clear “last known good” version visible in dashboards
- Proper observability to catch issues fast (see our observability guide)
In access control:
- On-call engineers can execute rollbacks without approvals
- Rollback commands are audited but not blocked
- Runbooks include one-line rollback commands
Testing Your Rollback
If you haven’t tested rollback, you don’t have rollback.
Quarterly drill:
- Deploy a new version to production
- Immediately roll back to the previous version
- Verify the rollback worked and the application functions correctly
- Document what failed or was confusing
This sounds excessive, but the first time you discover your rollback doesn’t work shouldn’t be during an outage.
Automated rollback testing:
Some teams run automated tests that deploy version N, then roll back to version N-1, and verify functionality. This catches incompatibilities early.
The Bottom Line
Rollback strategies that work at 3 AM have three properties:
- Fast: One command or button, not a procedure
- Tested: Regularly exercised, not theoretical
- Safe: Database-compatible, artifact-retained, well-understood
Blue-green is fast but expensive. Feature flags decouple code from behavior. Rolling deployments balance cost and safety. Immutable infrastructure enforces discipline.
Pick the strategy that fits your risk tolerance and operational complexity. Then test it when you’re awake, so it works when you’re not.
The goal isn’t perfection. The goal is getting back to a working state faster than the incident spirals. Rollback is a parachute—you hope you never need it, but you pack it carefully just in case.
Dealing with frequent production incidents or unclear rollback procedures? We help teams design deployment pipelines that actually work under pressure—blue-green deployments, automated rollback strategies, and database migration patterns that don’t break in production. Learn how we transformed one team’s deployment process to achieve 95% fewer deployment failures, or explore our cloud platform engineering services to build resilient infrastructure.