Database migrations are where deployment confidence goes to die. That ALTER TABLE that took 2 seconds in staging? It locks production for 40 minutes. That column rename you thought was straightforward? It breaks the application in ways your test suite didn’t catch.
Zero-downtime database migrations are possible, but they require changing how you think about schema changes. The techniques aren’t complicated individually; it’s the discipline of applying them consistently that’s hard.
Why Database Migrations Are Different
Application deployments are relatively forgiving. Deploy a bad version, roll it back, minimal damage. But database changes don’t roll back easily. Once you’ve dropped a column or changed a data type, reversing course requires another migration—if it’s possible at all.
More problematically, schema changes and application changes are tightly coupled. The application expects certain tables, columns, and types. Change the schema without updating the application, or vice versa, and things break.
This coupling is what makes zero-downtime migrations tricky: during any deployment, you have a period where old application code might run against a new schema, or new code against the old schema. Both directions need to work.
The Fundamental Pattern: Expand and Contract
The core technique for zero-downtime schema changes is expand and contract:
Expand: Add new schema elements without removing old ones. The database supports both old and new application code.
Migrate application: Deploy new application code that uses the new schema elements.
Contract: Remove the old schema elements once no application code depends on them.
This sequence ensures there’s never a moment when the schema and application are incompatible. The database expands to support both before the application moves, then contracts after the application has moved.
Renaming a Column (The Wrong Way)
Suppose you want to rename user_name to username. The obvious approach:
ALTER TABLE users RENAME COLUMN user_name TO username;
This is fast but immediately breaks any running application code that references user_name. During deployment, requests might hit old code expecting user_name or new code expecting username. Both can’t work simultaneously.
Renaming a Column (The Right Way)
Step 1: Expand—add the new column
ALTER TABLE users ADD COLUMN username VARCHAR(255);
UPDATE users SET username = user_name WHERE username IS NULL;
Now both columns exist. Old code uses user_name, new code can use username.
Step 2: Deploy application changes that write to both columns
Your application writes to both user_name and username, but reads from user_name (or username—pick one). This ensures both columns stay in sync during the transition.
Step 3: Backfill if needed
If you didn’t UPDATE in step 1, or if there’s ongoing activity, ensure data is synced:
UPDATE users SET username = user_name WHERE username != user_name OR username IS NULL;
Step 4: Deploy application changes that read from the new column only
Now the application uses username for both reads and writes. Old column is still updated (for safety) but not read.
Step 5: Contract—remove the old column
Once you’re confident the new column is correct and no old application code is running:
ALTER TABLE users DROP COLUMN user_name;
This sequence is more work than a simple rename, but at no point is the application broken.
The Locking Problem
On large tables, many schema changes acquire locks that block reads or writes. Even “simple” changes can lock a table for extended periods:
- Adding a column with a DEFAULT value (pre-Postgres 11 and many MySQL versions)
- Adding an index on a large table
- Modifying column types
- Adding foreign key constraints
Strategies for handling locks:
Online schema change tools: Tools like pt-online-schema-change (Percona), gh-ost (GitHub), or native features like MySQL 8’s ALGORITHM=INSTANT and Postgres’s concurrent index creation can make changes with minimal locking.
Off-peak execution: If some locking is unavoidable, run migrations during low-traffic periods. This doesn’t eliminate risk but reduces impact.
Small batches: Instead of one ALTER TABLE on a 100-million-row table, consider partitioning strategies or batched operations that work incrementally.
Test on production-sized data: That migration that takes 100ms on your test data takes 45 minutes on production. Test against realistic data volumes.
Adding NOT NULL Columns
Adding a NOT NULL column to an existing table is deceptively tricky:
-- This fails if the table has existing rows
ALTER TABLE users ADD COLUMN status VARCHAR(20) NOT NULL;
The expand-contract approach:
Add the column as nullable:
ALTER TABLE users ADD COLUMN status VARCHAR(20);Deploy application code that writes a value to
statusfor new rows.Backfill existing rows:
UPDATE users SET status = 'active' WHERE status IS NULL;(Do this in batches for large tables.)
Add the NOT NULL constraint:
ALTER TABLE users ALTER COLUMN status SET NOT NULL;Optionally add a DEFAULT for future convenience.
Index Creation
Creating indexes on large tables can lock the table for extended periods. Most databases offer concurrent index creation:
PostgreSQL:
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
MySQL 8:
CREATE INDEX idx_users_email ON users(email) ALGORITHM=INPLACE LOCK=NONE;
Concurrent index creation takes longer but doesn’t block writes. Use it for production migrations.
Foreign Key Constraints
Adding foreign keys can be problematic:
- The constraint check scans existing data
- On large tables, this takes time and may lock
- If data violates the constraint, the ALTER fails
Safer approach:
Add the constraint as NOT VALID (Postgres) or with constraint validation disabled:
ALTER TABLE orders ADD CONSTRAINT fk_orders_users FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;Validate the constraint in a separate step:
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_users;
The validation can run without blocking writes and can be retried if it fails.
Handling Failed Migrations
Migrations fail. Data doesn’t match assumptions. Constraints are violated. Disk space runs out mid-migration.
Plan for failure:
- Test migrations against production data copies before running on production
- Have a rollback plan (which may require its own expand-contract sequence)
- Monitor migration progress and be ready to abort
- Don’t combine multiple risky changes in one migration
Partial failures are the worst. A migration that completes or fails cleanly is manageable. A migration that gets halfway through and leaves the schema in an inconsistent state is a crisis. Break large migrations into steps that can succeed or fail independently.
Application Compatibility Windows
During expand-contract, you have a period where both old and new schema elements exist, and both old and new application code might run. This compatibility window needs to be long enough to:
- Deploy new application code across all instances
- Drain any in-flight requests on old instances
- Verify the new code is working correctly
- Optionally roll back if problems appear
How long? Depends on your deployment process. Blue-green deployments might switch traffic in minutes. Rolling deployments across hundreds of instances might take an hour. Plan your schema migration timing accordingly.
ORMs and Migration Tools
Most frameworks include migration tools (Rails migrations, Django migrations, Flyway, Liquibase). These tools help track which migrations have run but don’t automatically make those migrations zero-downtime.
ORM migrations are convenience, not safety. A Rails migration that calls rename_column generates the obvious (breaking) ALTER TABLE. You still need to manually implement expand-contract patterns.
Some tools offer zero-downtime-aware features:
strong_migrationsgem for Rails warns about unsafe migrations- Flyway callbacks can run pre/post-migration scripts
- Custom migration wrappers can enforce safety checks
But fundamentally, zero-downtime discipline is a process issue. Tools can help remind you; they can’t think through the implications for you.
Blue-Green Database Deployments
Blue-green deployment for stateless applications is straightforward: run two environments, switch traffic. For databases, it’s harder because both environments share state.
Options:
Shared database: Both blue and green environments use the same database. Schema must be compatible with both application versions (which is the expand-contract model).
Database replication: Blue uses the primary database; green uses a replica. Promote the replica when switching. This requires careful handling of replication lag and only works for read-heavy workloads.
Database-per-deployment: Each environment has its own database. Switching traffic also switches database. This requires migrating state between databases, which is its own can of worms.
For most teams, the shared database model with expand-contract migrations is the practical path.
Testing Migrations
Don’t test migrations only in development:
- Test on production-sized data. Performance characteristics change dramatically with scale.
- Test both directions. Can you roll back? What happens to data created between expand and contract?
- Test application compatibility. Run old application code against new schema and vice versa.
- Test failure scenarios. What happens if the migration is interrupted?
We’ve seen migrations that worked perfectly in staging take production down because no one realized the staging database had 1% of production’s data.
A Checklist
Before running a production migration:
- Does this migration lock the table? For how long?
- Have I tested against production-scale data?
- Is the schema compatible with the current running application code?
- If this fails halfway through, what state is the database in?
- Do I have a rollback plan?
- Have I communicated the migration timing to stakeholders?
- Is this a low-traffic period if locking is unavoidable?
Zero-downtime migrations are possible for almost any schema change. They’re just more work than the obvious approach. That work is worth it when the alternative is explaining to customers why the site was down for an hour at noon on a Tuesday.