Database Migration Strategies for Zero Downtime

Database migrations are where deployment confidence goes to die. That ALTER TABLE that took 2 seconds in staging? It locks production for 40 minutes. That column rename you thought was straightforward? It breaks the application in ways your test suite didn’t catch.

Zero-downtime database migrations are possible, but they require changing how you think about schema changes. The techniques aren’t complicated individually; it’s the discipline of applying them consistently that’s hard.

Why Database Migrations Are Different

Application deployments are relatively forgiving. Deploy a bad version, roll it back, minimal damage. But database changes don’t roll back easily. Once you’ve dropped a column or changed a data type, reversing course requires another migration—if it’s possible at all.

More problematically, schema changes and application changes are tightly coupled. The application expects certain tables, columns, and types. Change the schema without updating the application, or vice versa, and things break.

This coupling is what makes zero-downtime migrations tricky: during any deployment, you have a period where old application code might run against a new schema, or new code against the old schema. Both directions need to work.

The Fundamental Pattern: Expand and Contract

The core technique for zero-downtime schema changes is expand and contract:

Expand: Add new schema elements without removing old ones. The database supports both old and new application code.
Migrate application: Deploy new application code that uses the new schema elements.
Contract: Remove the old schema elements once no application code depends on them.

This sequence ensures there’s never a moment when the schema and application are incompatible. The database expands to support both before the application moves, then contracts after the application has moved.

Renaming a Column (The Wrong Way)

Suppose you want to rename user_name to username. The obvious approach:

ALTER TABLE users RENAME COLUMN user_name TO username;

This is fast but immediately breaks any running application code that references user_name. During deployment, requests might hit old code expecting user_name or new code expecting username. Both can’t work simultaneously.

Renaming a Column (The Right Way)

Step 1: Expand—add the new column

ALTER TABLE users ADD COLUMN username VARCHAR(255);
UPDATE users SET username = user_name WHERE username IS NULL;

Now both columns exist. Old code uses user_name, new code can use username.

Step 2: Deploy application changes that write to both columns

Your application writes to both user_name and username, but reads from user_name (or username—pick one). This ensures both columns stay in sync during the transition.

Step 3: Backfill if needed

If you didn’t UPDATE in step 1, or if there’s ongoing activity, ensure data is synced:

UPDATE users SET username = user_name WHERE username != user_name OR username IS NULL;

Step 4: Deploy application changes that read from the new column only

Now the application uses username for both reads and writes. Old column is still updated (for safety) but not read.

Step 5: Contract—remove the old column

Once you’re confident the new column is correct and no old application code is running:

ALTER TABLE users DROP COLUMN user_name;

This sequence is more work than a simple rename, but at no point is the application broken.

The Locking Problem

On large tables, many schema changes acquire locks that block reads or writes. Even “simple” changes can lock a table for extended periods:

Adding a column with a DEFAULT value (pre-Postgres 11 and many MySQL versions)
Adding an index on a large table
Modifying column types
Adding foreign key constraints

Strategies for handling locks:

Online schema change tools: Tools like pt-online-schema-change (Percona), gh-ost (GitHub), or native features like MySQL 8’s ALGORITHM=INSTANT and Postgres’s concurrent index creation can make changes with minimal locking.

Off-peak execution: If some locking is unavoidable, run migrations during low-traffic periods. This doesn’t eliminate risk but reduces impact.

Small batches: Instead of one ALTER TABLE on a 100-million-row table, consider partitioning strategies or batched operations that work incrementally.

Test on production-sized data: That migration that takes 100ms on your test data takes 45 minutes on production. Test against realistic data volumes.

Adding NOT NULL Columns

Adding a NOT NULL column to an existing table is deceptively tricky:

-- This fails if the table has existing rows
ALTER TABLE users ADD COLUMN status VARCHAR(20) NOT NULL;

The expand-contract approach:

Add the column as nullable:

ALTER TABLE users ADD COLUMN status VARCHAR(20);

Deploy application code that writes a value to status for new rows.

Backfill existing rows:

UPDATE users SET status = 'active' WHERE status IS NULL;

(Do this in batches for large tables.)

Add the NOT NULL constraint:

ALTER TABLE users ALTER COLUMN status SET NOT NULL;

Optionally add a DEFAULT for future convenience.

Index Creation

Creating indexes on large tables can lock the table for extended periods. Most databases offer concurrent index creation:

PostgreSQL:

CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

MySQL 8:

CREATE INDEX idx_users_email ON users(email) ALGORITHM=INPLACE LOCK=NONE;

Concurrent index creation takes longer but doesn’t block writes. Use it for production migrations.

Foreign Key Constraints

Adding foreign keys can be problematic:

The constraint check scans existing data
On large tables, this takes time and may lock
If data violates the constraint, the ALTER fails

Safer approach:

Add the constraint as NOT VALID (Postgres) or with constraint validation disabled:

ALTER TABLE orders ADD CONSTRAINT fk_orders_users
  FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;

Validate the constraint in a separate step:

ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_users;

The validation can run without blocking writes and can be retried if it fails.

Handling Failed Migrations

Migrations fail. Data doesn’t match assumptions. Constraints are violated. Disk space runs out mid-migration.

Plan for failure:

Test migrations against production data copies before running on production
Have a rollback plan (which may require its own expand-contract sequence)
Monitor migration progress and be ready to abort
Don’t combine multiple risky changes in one migration

Partial failures are the worst. A migration that completes or fails cleanly is manageable. A migration that gets halfway through and leaves the schema in an inconsistent state is a crisis. Break large migrations into steps that can succeed or fail independently.

Application Compatibility Windows

During expand-contract, you have a period where both old and new schema elements exist, and both old and new application code might run. This compatibility window needs to be long enough to:

Deploy new application code across all instances
Drain any in-flight requests on old instances
Verify the new code is working correctly
Optionally roll back if problems appear

How long? Depends on your deployment process. Blue-green deployments might switch traffic in minutes. Rolling deployments across hundreds of instances might take an hour. Plan your schema migration timing accordingly.

ORMs and Migration Tools

Most frameworks include migration tools (Rails migrations, Django migrations, Flyway, Liquibase). These tools help track which migrations have run but don’t automatically make those migrations zero-downtime.

ORM migrations are convenience, not safety. A Rails migration that calls rename_column generates the obvious (breaking) ALTER TABLE. You still need to manually implement expand-contract patterns.

Some tools offer zero-downtime-aware features:

strong_migrations gem for Rails warns about unsafe migrations
Flyway callbacks can run pre/post-migration scripts
Custom migration wrappers can enforce safety checks

But fundamentally, zero-downtime discipline is a process issue. Tools can help remind you; they can’t think through the implications for you.

Blue-Green Database Deployments

Blue-green deployment for stateless applications is straightforward: run two environments, switch traffic. For databases, it’s harder because both environments share state.

Options:

Shared database: Both blue and green environments use the same database. Schema must be compatible with both application versions (which is the expand-contract model).

Database replication: Blue uses the primary database; green uses a replica. Promote the replica when switching. This requires careful handling of replication lag and only works for read-heavy workloads.

Database-per-deployment: Each environment has its own database. Switching traffic also switches database. This requires migrating state between databases, which is its own can of worms.

For most teams, the shared database model with expand-contract migrations is the practical path.

Testing Migrations

Don’t test migrations only in development:

Test on production-sized data. Performance characteristics change dramatically with scale.
Test both directions. Can you roll back? What happens to data created between expand and contract?
Test application compatibility. Run old application code against new schema and vice versa.
Test failure scenarios. What happens if the migration is interrupted?

We’ve seen migrations that worked perfectly in staging take production down because no one realized the staging database had 1% of production’s data.

A Checklist

Before running a production migration:

Does this migration lock the table? For how long?
Have I tested against production-scale data?
Is the schema compatible with the current running application code?
If this fails halfway through, what state is the database in?
Do I have a rollback plan?
Have I communicated the migration timing to stakeholders?
Is this a low-traffic period if locking is unavoidable?

Zero-downtime migrations are possible for almost any schema change. They’re just more work than the obvious approach. That work is worth it when the alternative is explaining to customers why the site was down for an hour at noon on a Tuesday.

Database Migrations Without Downtime

Why Database Migrations Are Different

The Fundamental Pattern: Expand and Contract

Renaming a Column (The Wrong Way)

Renaming a Column (The Right Way)

The Locking Problem

Adding NOT NULL Columns

Index Creation

Foreign Key Constraints

Handling Failed Migrations

Application Compatibility Windows

ORMs and Migration Tools

Blue-Green Database Deployments

Testing Migrations

A Checklist

Continue Reading

Rollback Strategies That Actually Work When You're Half-Asleep

Migrating from CircleCI to GitHub Actions: A Practical Guide

Build vs Buy Decisions: A Framework That Works

Have a Project
In Mind?

Why Database Migrations Are Different

The Fundamental Pattern: Expand and Contract

Renaming a Column (The Wrong Way)

Renaming a Column (The Right Way)

The Locking Problem

Adding NOT NULL Columns

Index Creation

Foreign Key Constraints

Handling Failed Migrations

Application Compatibility Windows

ORMs and Migration Tools

Blue-Green Database Deployments

Testing Migrations

A Checklist

Continue Reading

Rollback Strategies That Actually Work When You're Half-Asleep

Migrating from CircleCI to GitHub Actions: A Practical Guide

Build vs Buy Decisions: A Framework That Works

Have a ProjectIn Mind?

Have a Project
In Mind?