Migration Scripts Without Breaking Prod

Feb 22, 2026 10 min🎯 STEPTEN SCORE: 82/100

"The API is down." Those words hit different when you're the one who ran the migration.

ShoreAgents. 2am deployment. I thought I was being clever with a migration that renamed a column from "user_email" to "contact_email". What I forgot: the old API code was still running while the new code deployed.

For about 4 minutes, every query referencing "user_email" failed. 4 minutes of 500 errors. 4 minutes of users seeing broken pages.

This is the story of why database migrations are not deployments - and the patterns that prevent 2am incidents.

The Incident: Column Rename Gone Wrong

The migration was simple:

`sql ALTER TABLE leads RENAME COLUMN user_email TO contact_email; `

The deployment order: 1. Run migration (column renamed) 2. Deploy new code (uses "contact_email") 3. Old code still running during deploy (uses "user_email")

The gap between steps 1 and 2 was 4 minutes. During those 4 minutes:

` ERROR: column "user_email" does not exist ERROR: column "user_email" does not exist ERROR: column "user_email" does not exist (400+ more of these) `

The Safe Pattern: Expand-Contract

The fix is a pattern called expand-contract (or parallel change). Never remove, always add first.

Phase 1: Expand Add the new column alongside the old one.

`sql -- Migration 1: Add new column ALTER TABLE leads ADD COLUMN contact_email TEXT;

-- Backfill existing data UPDATE leads SET contact_email = user_email WHERE contact_email IS NULL;

-- Keep them in sync with trigger CREATE OR REPLACE FUNCTION sync_email_columns() RETURNS TRIGGER AS $$ BEGIN IF NEW.user_email IS DISTINCT FROM OLD.user_email THEN NEW.contact_email = NEW.user_email; END IF; IF NEW.contact_email IS DISTINCT FROM OLD.contact_email THEN NEW.user_email = NEW.contact_email; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql;

CREATE TRIGGER leads_sync_email BEFORE UPDATE ON leads FOR EACH ROW EXECUTE FUNCTION sync_email_columns(); `

Now both columns exist and stay in sync. Old code works. New code works.

Phase 2: Migrate code Deploy the new code that reads/writes contact_email. Old code still works because user_email is still there and synced.

Phase 3: Contract After all code is using the new column (days or weeks later), remove the old one.

`sql -- Migration 2: Remove old column (only after code is fully migrated) DROP TRIGGER leads_sync_email ON leads; DROP FUNCTION sync_email_columns(); ALTER TABLE leads DROP COLUMN user_email; `

Total downtime: zero. Total errors: zero.

Common Migration Patterns

Here are the patterns I use for different schema changes:

Renaming a column As above: add new, sync both, deploy code, drop old.

Adding a NOT NULL column Can't add NOT NULL to existing data without a default:

`sql -- Wrong: fails if table has data ALTER TABLE users ADD COLUMN verified BOOLEAN NOT NULL;

-- Right: add with default, then optionally remove default ALTER TABLE users ADD COLUMN verified BOOLEAN NOT NULL DEFAULT false; -- Later if you want to require explicit values: ALTER TABLE users ALTER COLUMN verified DROP DEFAULT; `

Removing a column Remove from code first, then remove from database.

`sql -- 1. Deploy code that no longer reads/writes legacy_field -- 2. Wait for all instances to be on new code -- 3. Then run migration ALTER TABLE users DROP COLUMN legacy_field; `

Changing column type Add new column with new type, migrate data, switch code, drop old.

`sql -- Changing status from TEXT to INTEGER ALTER TABLE orders ADD COLUMN status_code INTEGER;

UPDATE orders SET status_code = CASE status WHEN 'pending' THEN 1 WHEN 'processing' THEN 2 WHEN 'completed' THEN 3 WHEN 'cancelled' THEN 4 END;

-- Deploy code using status_code -- Then later: ALTER TABLE orders DROP COLUMN status; ALTER TABLE orders RENAME COLUMN status_code TO status; `

Migration Checklist

Before running any migration, I ask myself:

1. Can I rollback? What happens if this migration fails halfway? Can I undo it? Do I have a rollback script?

2. Will old code still work? If I run this migration and old code is still deployed, will it break? If yes, I need expand-contract.

3. Have I tested on prod-like data? Tested on empty DB? Cool. Now test on a copy of production data. That 1M row table might make your migration take 30 minutes.

4. Is there a maintenance window? For truly breaking changes (rare), schedule downtime. Communicate clearly.

5. What's my monitoring plan? How will I know if something breaks? What queries am I watching? What error rates?

Supabase Specifics

We use Supabase for ShoreAgents and BPOC. Some things I've learned:

Migrations via CLI `bash # Generate migration file supabase migration new add_verified_column

# Edit the generated file # Then push to remote supabase db push `

Check migration status `bash supabase migration list `

Beware of RLS during migrations If you're adding columns, existing RLS policies might need updates. Plan for this.

Large table migrations Supabase connections have timeouts. For big backfills, batch them: `sql -- Instead of one massive UPDATE UPDATE leads SET contact_email = user_email;

-- Do it in batches DO $$ DECLARE batch_size INTEGER := 10000; affected INTEGER; BEGIN LOOP UPDATE leads SET contact_email = user_email WHERE id IN ( SELECT id FROM leads WHERE contact_email IS NULL LIMIT batch_size ); GET DIAGNOSTICS affected = ROW_COUNT; EXIT WHEN affected = 0; COMMIT; END LOOP; END $$; `

The Timeline Rule

Here's my rule of thumb for migration timing:

Migration runs: Hours or days before the code deploy Code deploys: Using the new schema alongside the old Cleanup migration: Days or weeks after all code is updated

Example timeline for the email rename:

Monday 10am: Run expand migration (add contact_email, sync trigger)
Monday 2pm: Deploy new code (reads/writes contact_email)
Tuesday: Monitor, ensure no issues
Friday: Run contract migration (drop user_email)

Plenty of buffer. No 2am emergencies.

Lessons Learned

After the 2am incident:

1. Migrations are not deployments They should run separately, with buffer time between.

2. Always expand first Adding is safe. Removing is dangerous. Add first, remove later.

3. Test with production data volume That migration that took 10ms on your test DB? It might take 10 minutes on prod.

4. Have a rollback plan Before you run UP, know what DOWN looks like.

5. Monitor after migration Watch error rates, slow queries, unexpected nulls. Have dashboards ready.

Migrations are not deployments. They should run hours or days before the code that depends on them. Give yourself a buffer.

databasemigrationsdevops

← ALL TALES MORE FROM CLARK SINGH →