Database Migrations: Ship Schema Changes Without Breaking Production

Database Migration Strategy for Your New SaaS

Goal: Build a database migration discipline that lets you ship schema changes fearlessly — without locking tables for hours, breaking running queries, or rolling back data corruption at 2am. Avoid the failure modes where founders dread schema changes (so they don't ship them) or ship them carelessly (so they break production).

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: Migration tooling decision in 1 day. First production-safe migration runbook in week 1. Zero-downtime migration patterns embedded in PR review by week 3. Quarterly schema-review cadence baked in.

Why Most Founder Migration Practices Fail

Three failure modes hit founders the same way:

The "ALTER TABLE in production at 2am" pattern. Founder ships schema change via a shell session into prod. The ALTER TABLE acquires a lock, blocks every read query, and the application 502s for 4 minutes. Customers report; support inbox explodes. Or worse: the migration partially succeeds and leaves the database in inconsistent state. Recovery takes hours.
No rollback plan. Migration applied; bug discovered; need to revert. The "revert" migration was never written, never tested, and dropping a column means losing data. Founders learn from this once; the next migration has a rollback plan, but the team hasn't internalized the pattern.
Tightly-coupled code-and-schema deploys. Schema migration and application code change are deployed together. If anything goes wrong, the application can't start (new code expects new schema; old schema doesn't match). Long deploy windows; manual rollback complexity. The right pattern is decoupled: migrate first, deploy code that handles BOTH old and new schemas, eventually clean up the old schema in a follow-up.

The version that works is structured: pick a migration tool, write reversible migrations by default, follow the expand-contract pattern for production-safe changes, test rollbacks, run quarterly schema audits, and treat migrations as application changes (not ops emergencies).

This guide assumes you have already chosen a Database Provider (Postgres-flavored is assumed; the patterns apply across providers), have shipped Backups & Disaster Recovery (the safety net), and have wired CI/CD (migrations run automatically).

1. Pick a Migration Tool

The first decision. Pick deliberately — switching later is real work.

You're helping me pick the database migration tool for [your stack — Next.js / SvelteKit / your framework] using [Postgres via Supabase / Neon / RDS / your provider].

The options for Postgres-flavored databases:

**Drizzle ORM + Drizzle Kit** — modern TypeScript-first
- Schema as TypeScript code; migrations generated from schema diffs
- Lightweight; SQL-first feel; type-safe
- Strong DX; growing community
- Default for new TypeScript projects in 2026

**Prisma + Prisma Migrate** — incumbent ORM
- Schema in Prisma's DSL; migrations auto-generated
- Mature; broad ecosystem; Prisma Studio for inspection
- Slightly heavier than Drizzle
- Strong default for TypeScript / JavaScript projects

**Knex / Objection** — query-builder + raw migrations
- More manual; you write SQL migrations directly
- Mature
- Pick when you don't want full ORM abstraction

**Atlas / Squawk** — declarative schema management
- Atlas: declarative state; tool generates migrations
- Squawk: lints SQL migrations for safety patterns
- Pick alongside any of the above for production-safety

**Supabase Migrations** — built into Supabase CLI
- Supabase project management; migrations versioned in repo
- Use if Supabase is your DB layer
- Lightweight; CLI-based

**Convex** — schema in TypeScript; migrations handled by Convex
- If you're on Convex, schema changes are different from traditional Postgres migrations
- Convex pushes schema with code deploys
- Pick when Convex is your stack

**Raw SQL + custom runner** — DIY
- Write `.sql` files; custom runner applies them
- More work; full control
- Common in older codebases; rare in 2026 indie SaaS

For most indie SaaS in 2026 on Postgres:
- **TypeScript / Next.js / SvelteKit**: Drizzle or Prisma
- **Supabase customer**: Drizzle/Prisma + Supabase CLI for syncing
- **Polyglot or Rails / Django / Laravel**: framework-native migration tools

Output:
1. The recommended tool for my stack
2. The reasoning (Drizzle vs Prisma if applicable)
3. The setup steps
4. The folder structure for migrations
5. The CI integration: how migrations apply on every deploy

Three principles I've watched founders re-learn:

Pick once; commit to it. Migration tooling is hard to swap. Choose deliberately at week 1; iterate within the chosen tool.
Schema-as-code beats SQL-only migrations. Drizzle / Prisma let you express schema intent in TypeScript; the generated SQL handles the diff. Less drift between code and database.
The CI integration matters most. A migration tool nobody runs in CI is a tool nobody runs. Build the deploy-time migration step at week 1.

2. Write Reversible Migrations by Default

Every migration has an UP path and a DOWN path. Most founders skip the DOWN. Don't.

Help me design the migration discipline for reversibility.

Every migration includes:
- UP: the change to apply
- DOWN: the change to reverse

For each migration type, the DOWN pattern:

**ADD COLUMN**:
- UP: ALTER TABLE x ADD COLUMN y TYPE
- DOWN: ALTER TABLE x DROP COLUMN y
- Risk: dropping a column loses data; only run DOWN in pre-deploy reverts

**DROP COLUMN**:
- UP: rename to deprecated_*; do not actually DROP
- DOWN: rename back
- Wait 30+ days before actually dropping (separate migration)
- This pattern saves you from the "I dropped the wrong column" disaster

**ADD TABLE**:
- UP: CREATE TABLE
- DOWN: DROP TABLE
- Easy reversal; no data loss if table is empty at revert time

**DROP TABLE**:
- UP: rename to dropped_*; backup the data
- DOWN: rename back
- Wait 30+ days; then DROP in a separate migration after verifying nothing references it

**RENAME COLUMN**:
- UP + DOWN: simple rename; reversible
- Watch out for application code that references the column

**ADD INDEX**:
- UP: CREATE INDEX CONCURRENTLY (Postgres specific; doesn't lock writes)
- DOWN: DROP INDEX CONCURRENTLY
- Critical: NEVER use CREATE INDEX without CONCURRENTLY in production; locks the table

**MODIFY COLUMN TYPE** (e.g., VARCHAR → TEXT, or INT → BIGINT):
- This is where most production outages start
- See "expand-contract pattern" below; do NOT do this as a single ALTER

**ADD CHECK CONSTRAINT or FOREIGN KEY**:
- UP: ALTER TABLE ADD CONSTRAINT name CHECK / FOREIGN KEY ... NOT VALID
  - NOT VALID skips the validation check at constraint creation (fast)
- Then: ALTER TABLE VALIDATE CONSTRAINT name (slow but doesn't lock writes)
- DOWN: ALTER TABLE DROP CONSTRAINT

**ADD UNIQUE**:
- UP: CREATE UNIQUE INDEX CONCURRENTLY first; then ADD CONSTRAINT USING INDEX
- DOWN: drop the constraint and the index

Output:
1. The migration template with UP/DOWN sections
2. The lint rules (Squawk-style) that catch dangerous patterns at PR review:
   - No CREATE INDEX without CONCURRENTLY in production migrations
   - No ALTER TABLE ... ADD COLUMN ... NOT NULL without DEFAULT (locks table)
   - No DROP TABLE / DROP COLUMN in same migration as code changes
3. The PR-review checklist for any migration
4. The deploy-time check: every applied migration's reversibility verified pre-deploy

Three rules that prevent disasters:

DROP is a 30-day-deferred operation. Renaming to deprecated_* first, dropping later, prevents the "I dropped the wrong thing" disaster.
CREATE INDEX always with CONCURRENTLY in Postgres. The lock from non-concurrent index creation has caused more production outages than any other migration mistake.
NOT NULL columns get DEFAULT values, always. Otherwise the migration locks the table while it backfills.

3. Use the Expand-Contract Pattern for Production-Safe Changes

The single most important migration pattern. Most outages come from skipping it.

For changes that aren't trivially reversible, use the expand-contract pattern.

The pattern:

**Phase 1: EXPAND** — make the schema accept both old and new shapes
- Add the new column / table / shape WITHOUT touching old
- Migration #1: pure additive change
- Application code: still using old shape; new column is unused
- Result: zero risk; no rollback drama

**Phase 2: BACKFILL** — populate the new shape from the old
- Background job copies data from old → new
- Application code: writes to BOTH old and new; reads from old
- Result: data parity achieved; reads still consistent

**Phase 3: SWITCH** — application reads from new shape
- Application code: writes to BOTH; reads from NEW
- Verify nothing has broken in production
- Wait 7-30 days

**Phase 4: CONTRACT** — drop the old shape
- Migration #2: drop the old column / table
- Application code: writes only to NEW; reads only from NEW
- Wait 30+ days from phase 3 before phase 4 to ensure no surprise dependencies

Worked example: changing column type INT → BIGINT (e.g., user IDs running out of 32-bit space):

Phase 1: ADD COLUMN id_new BIGINT (no constraint yet)
Phase 2: backfill id_new = id for all existing rows; new INSERTs write both id and id_new
Phase 3: switch reads to id_new; verify
Phase 4: drop id column; rename id_new to id

Total time: 30-90 days for a serious change. Each phase is a separate deploy.

When to skip the pattern:
- Pre-launch: just blow away the schema and rebuild
- Tiny tables (<1000 rows): the lock is brief; direct migration is fine
- Internal-only tables: customer-impact is zero

When to ALWAYS use the pattern:
- Multi-million-row tables
- Customer-facing tables
- Any production database that runs hot

Output:
1. The 4-phase template for any non-trivial migration
2. The decision tree: "is this safe to do in one phase, or do I need expand-contract?"
3. The runbook for the BACKFILL phase (background job, batched, monitored)
4. The verification checklist for the SWITCH phase
5. The waiting periods between phases (7-30 days minimum)

The single most consequential discipline: never combine expand + contract in one deploy. The two-deploy minimum (expand on Monday, switch on Wednesday after verifying) is what makes schema changes safe in production. Founders who try to do it all in one deploy create their own outages.

4. Run Migrations in CI/CD Automatically

Manual migrations at 2am are a sign of broken process. Automate.

Help me wire migrations into [your CI/CD provider].

The pattern:

**On PR**:
- Migration files in PR auto-validated (linter checks dangerous patterns)
- CI runs the migration against a test database
- Tests run after migration applied
- If anything fails, the PR is blocked

**On merge to main**:
- CI applies the migration to staging environment first
- Run smoke tests
- If staging is green, deploy code (which assumes the migration has run)
- Production migration applied as part of deploy

**Critical**:
- Migrations run BEFORE code deploys, not after
- Code is forward-compatible: handles both pre-migration and post-migration schema
- This decoupling means you can roll back code without rolling back database

**Migration vs deploy ordering**:
1. Apply migration (additive only, per expand-contract phase 1)
2. Deploy code (handles both old and new schemas, in expand-contract phase 2-3)
3. Eventually run cleanup migration (expand-contract phase 4)

**Concurrency**:
- Only one migration can run against the database at a time
- Use a lock table or advisory lock; Drizzle / Prisma handle this for you
- Skip-migration step if another deploy holds the lock

**Failure handling**:
- If migration fails: deploy halts; alert; manual investigation
- If code deploy fails: roll back code; migration stays (because migrations are additive only in phase 1)
- If cleanup migration fails: investigate; rollback code if necessary

Output:
1. The CI pipeline config (GitHub Actions / your CI)
2. The migration-runner script
3. The lock-acquisition logic
4. The failure-handling runbook
5. The pre-deploy + post-deploy notification (Slack / email)

The biggest mistake: migrations as deploy-time scripts inside the application server startup. This couples app start to migration completion; if migration takes 30 minutes, your app is down for 30 minutes. Always run migrations as a separate step BEFORE app deploy.

5. Test Rollbacks (Yes, Actually)

A rollback that's never been tested is a hypothesis. Test it.

Design the migration-rollback test suite.

For each new migration, test:

**Test 1: UP applies cleanly**
- On a fresh test database with the prior schema
- Apply the migration
- Verify the schema matches expected post-migration state
- Verify no errors

**Test 2: DOWN reverses cleanly**
- On a database after UP applied
- Apply the DOWN
- Verify the schema returns to prior state
- Note any data loss (e.g., a DROPPED column means data is lost)

**Test 3: Application code works at every phase**
- Code with the prior schema: works
- Code with the new schema: works
- Code at the in-between state (data partially migrated): works
- This is the expand-contract verification

**Test 4: Production-data simulation**
- Spin up a copy of production data (per [Backups & Disaster Recovery](backups-disaster-recovery-chat.md))
- Apply the migration; time it
- Verify performance: did the migration lock anything beyond expectations? Did it complete in reasonable time?

**Test 5: Concurrent-load simulation**
- Apply the migration while running synthetic application load
- Verify no errors during migration; no degraded performance worse than expected

Run these tests:
- For every migration in CI (Tests 1-3)
- Before any high-risk migration in staging (Tests 4-5)

Output:
1. The test framework setup
2. The CI integration
3. The staging-test runbook for high-risk migrations
4. The "high-risk" criteria: which migrations need extra testing

The discipline that prevents 2am incidents: Test 4 (production-data simulation) for any migration touching tables >1M rows. This catches "the migration runs in 2 seconds in dev, 4 hours in prod" disasters before they happen.

6. Schedule High-Risk Migrations Carefully

Not all migrations are equal. The risky ones get special handling.

Design the schedule for high-risk migrations.

**Migration risk classification**:

**Low risk** (run anytime):
- ADD COLUMN with DEFAULT (or NULL allowed)
- ADD TABLE
- ADD INDEX CONCURRENTLY
- Reversible, additive, no data movement

**Medium risk** (run in business hours, with team awareness):
- BACKFILL operations on large tables
- Constraint additions (ADD ... NOT VALID, then VALIDATE)
- Renaming operations

**High risk** (run during a planned maintenance window):
- DROP TABLE / DROP COLUMN (the actual drop, after 30-day deprecation)
- ALTER COLUMN TYPE (without expand-contract — usually wrong)
- Multi-table data migrations
- Anything that might lock production tables for >30 seconds

**Production maintenance window planning**:
- Pick a low-traffic time (typically Saturday early morning UTC for global products)
- Communicate via [Status Page](status-page-chat.md) 7+ days in advance
- Run the migration in staging first; time it
- Have rollback runbook ready
- Have on-call engineer monitoring during the window

**Communications template** (for high-risk migrations affecting customers):
- T-7 days: status page entry "Scheduled maintenance: [feature] on [date]"
- T-1 day: reminder email to customers
- T-30 min: status page → "in progress"
- During: status updates every 30 min
- After: "completed" with brief explanation

Output:
1. The risk-classification rubric
2. The scheduling process (when to schedule each tier)
3. The maintenance-window runbook
4. The customer-communication templates

The single most useful habit: time the migration in staging against production-sized data BEFORE scheduling. A migration that takes "2 minutes" against a 10K-row test database might take 4 hours against a 50M-row production table. Time it.

7. Quarterly Schema Audit

Schemas drift. Tables accumulate fields nobody uses. Indexes that no longer help. Audit.

Design the quarterly schema audit.

Every 90 days, review:

**1. Unused columns**:
- Columns nobody reads or writes
- Found via: Postgres pg_stat_user_tables / similar; or query log analysis
- Action: deprecate (rename to deprecated_*); 30 days later, drop

**2. Unused indexes**:
- Indexes never used by any query
- Found via: pg_stat_user_indexes (idx_scan = 0)
- Action: drop (saves write cost; reduces table size)
- Caution: don't drop indexes used only in rare reports

**3. Missing indexes**:
- Slow queries that would benefit from an index
- Found via: Postgres slow-query log; pg_stat_statements
- Action: add CONCURRENTLY; verify performance

**4. Schema bloat**:
- Tables with autovacuum issues, dead row accumulation
- Found via: Postgres bloat queries
- Action: VACUUM FULL during maintenance window if severe (locks table); or pg_repack online

**5. Type mismatches**:
- Columns with INT that should be BIGINT (running out of space)
- Columns with VARCHAR(N) that should be TEXT (arbitrary cutoffs)
- Action: schedule via expand-contract pattern

**6. Foreign key audit**:
- Are all FKs in place where they should be?
- Are there cross-tenant FKs that violate [Multi-Tenant Data Isolation](multi-tenancy-chat.md)?
- Action: add missing FKs; remove cross-tenant violations

**7. Constraint audit**:
- Are NOT NULL, CHECK, UNIQUE constraints where they should be?
- Action: ADD ... NOT VALID then VALIDATE for any missing

Output:
1. The audit-query SQL set
2. The 90-day schedule (quarterly)
3. The action-prioritization rubric (what to fix first)
4. The audit-report template (track over time)

The audit catches drift before it becomes a problem. Without it, by year 2 your schema has 30 unused indexes, 12 deprecated-but-not-dropped columns, and 5 tables nobody uses anymore — slowing every operation incrementally.

8. Document Migration History

A schema is the cumulative result of every migration. Make it readable.

Maintain documentation of every meaningful migration.

For each migration:
- Migration filename / number
- Date applied (in production)
- Author (PR link)
- The "why" — not the SQL, but the business reason
- The phase (if expand-contract): which of the 4 phases this represents
- Any rollback instructions specific to this migration
- Performance notes (how long it took in production)

Maintain in:
- `/db/migrations/` (the actual migration files)
- `/db/CHANGELOG.md` — summary of meaningful schema changes by date
- Linked from `/db/README.md` — high-level architecture overview

The audit: a new engineer joining the team in year 2 should be able to understand the schema's evolution from CHANGELOG.md.

Output:
1. The CHANGELOG.md template
2. The README.md template for the db folder
3. The PR-template requirement: every migration PR must update the CHANGELOG

The single most valuable artifact: the README + CHANGELOG explaining "why" each major schema decision was made. Code comments rot; PR comments are buried; the README + CHANGELOG is the durable record.

What Done Looks Like

By end of week 2 of building migration discipline:

Migration tool picked + integrated (Drizzle / Prisma / your choice)
CI/CD running migrations automatically
Migration linter catching dangerous patterns at PR review
Reversibility template for every new migration
Expand-contract pattern documented and adopted for non-trivial changes
First production migration shipped following the pattern
CHANGELOG.md capturing migration history

Within 90 days:

Zero "ALTER TABLE in production at 2am" incidents
1+ expand-contract migration successfully shipped
1 quarterly audit completed with action items
Migration-related deploy time reduced (CI runs migrations; humans don't)

Within 12 months:

Migration discipline is invisible because it works
Schema is clean (audits caught drift early)
New engineers can understand schema history from CHANGELOG
Major schema changes (e.g., type changes) ship without incident

Common Pitfalls

Manual migrations in production. Always run via CI/CD with proper safety checks.
No DOWN path. Every migration must have a rollback plan.
Combining schema and code in single deploy. Decouple; expand-contract.
CREATE INDEX without CONCURRENTLY. The single most common production-outage cause.
DROP TABLE / DROP COLUMN immediately. Always rename-first, drop-later (30+ days).
Skipping expand-contract for "simple" type changes. Postgres ALTER COLUMN TYPE is rarely simple at scale.
Untested rollbacks. A rollback that's never been tested isn't a rollback.
No staging-test for risky migrations. Time the migration against production-sized data first.
Treating migrations as "ops" not "code." They're code; review, test, and version like code.
Forgetting the CHANGELOG. A year in, nobody remembers why a column exists. The CHANGELOG is the answer.

Where Database Migrations Plug Into the Rest of the Stack

Database Providers — different providers have different migration tooling
Backups & Disaster Recovery — restore-from-backup is the ultimate rollback for catastrophic migration mistakes
Multi-Tenant Data Isolation — schema changes must respect tenant scoping
Audit Logs — schema changes themselves can be audit-logged
CI/CD Providers — migrations run in the deploy pipeline
Status Page — high-risk migrations announced via status page
Incident Response — migration mistakes cause incidents; runbook applies
PostHog Setup — application analytics for verifying migrations work
Drizzle vs Prisma — the migration tool choice deep-dive
Postgres — engine-specific migration patterns

What's Next

Database migrations are the unsexy infrastructure that makes shipping product changes possible. The team that builds discipline early — proper tooling, expand-contract pattern, CI integration, quarterly audits — ships schema changes weekly without drama. The team that defers it ships changes scared, slowly, with occasional outages.

Build the discipline now while the schema is small. The patterns scale; the discipline compounds. By year 2, you're shipping schema changes confidently while competitors are still firefighting their last migration mistake.

⬅️ Growth Overview