Zero-Downtime Database Migrations in Financial Systems

Introduction #

In financial systems, downtime means lost revenue and regulatory risk. A trading platform processing billions in transactions can't afford a maintenance window. This post shares our battle-tested strategies for zero-downtime database migrations.

The Challenge #

Traditional migration approach:

bash

1# ❌ Don't do this in production
21. Stop application
32. Run migrations
43. Start application
5

Downtime: 5-30 minutes
Revenue impact: $50k -$ 500k per hour for large platforms
Regulatory risk: Trading halts trigger regulatory scrutiny

The Zero-Downtime Pattern #

Core principle: Deploy in phases, never break backward compatibility

plaintext

1Phase 1: Expand (add new schema)
2Phase 2: Migrate (copy data)
3Phase 3: Contract (remove old schema)
4

Each phase is a separate deployment with full backward compatibility.

Example: Adding a Column #

❌ Breaking Change #

sql

1-- This breaks running code immediately
2ALTER TABLE orders ADD COLUMN execution_venue VARCHAR(50) NOT NULL;
3

✅ Zero-Downtime Approach #

Phase 1: Add nullable column

sql

1-- Migration v1
2ALTER TABLE orders ADD COLUMN execution_venue VARCHAR(50) NULL;
3
4-- Old code continues working (ignores new column)
5-- New code can start using it
6

Phase 2: Backfill data

sql

1-- Migration v2 (run during low traffic)
2UPDATE orders 
3SET execution_venue = 'NYSE' 
4WHERE execution_venue IS NULL 
5  AND symbol IN (SELECT symbol FROM nyse_listings);
6
7UPDATE orders 
8SET execution_venue = 'NASDAQ' 
9WHERE execution_venue IS NULL;
10

Phase 3: Add constraint

sql

1-- Migration v3 (after all old code deployed)
2ALTER TABLE orders 
3ALTER COLUMN execution_venue SET NOT NULL;
4
5CREATE INDEX idx_orders_venue ON orders(execution_venue);
6

Renaming Columns #

Renaming is especially tricky - old and new code must coexist.

The View Approach #

Phase 1: Create view with both names

sql

1-- Migration v1
2ALTER TABLE trades ADD COLUMN execution_timestamp TIMESTAMPTZ;
3
4-- Update new column when old column is written
5CREATE OR REPLACE FUNCTION sync_execution_time()
6RETURNS TRIGGER AS $$
7BEGIN
8  IF NEW.execution_time IS NOT NULL THEN
9    NEW.execution_timestamp = NEW.execution_time;
10  END IF;
11  RETURN NEW;
12END;
13$$ LANGUAGE plpgsql;
14
15CREATE TRIGGER sync_execution_time_trigger
16BEFORE INSERT OR UPDATE ON trades
17FOR EACH ROW
18EXECUTE FUNCTION sync_execution_time();
19
20-- Backfill existing data
21UPDATE trades 
22SET execution_timestamp = execution_time 
23WHERE execution_timestamp IS NULL;
24

Phase 2: Update application code

typescript

1// Old code (still running)
2const trade = await db.query(
3  'INSERT INTO trades (symbol, quantity, execution_time) VALUES ($1, $2, $3)',
4  ['AAPL', 100, new Date()]
5);
6
7// New code (gradual rollout)
8const trade = await db.query(
9  'INSERT INTO trades (symbol, quantity, execution_timestamp) VALUES ($1, $2, $3)',
10  ['AAPL', 100, new Date()]
11);
12

Phase 3: Remove old column

sql

1-- Migration v2 (after all old code replaced)
2DROP TRIGGER sync_execution_time_trigger ON trades;
3DROP FUNCTION sync_execution_time();
4ALTER TABLE trades DROP COLUMN execution_time;
5

Changing Column Types #

Even more complex - requires data transformation.

Example: Numeric precision change #

Current: price DECIMAL(10,2)
Needed: price DECIMAL(18,8) (for crypto trading)

Phase 1: Shadow column

sql

1-- Migration v1
2ALTER TABLE orders ADD COLUMN price_precise DECIMAL(18,8);
3
4-- Dual write trigger
5CREATE OR REPLACE FUNCTION sync_price()
6RETURNS TRIGGER AS $$
7BEGIN
8  IF NEW.price IS NOT NULL THEN
9    NEW.price_precise = NEW.price::DECIMAL(18,8);
10  END IF;
11  IF NEW.price_precise IS NOT NULL THEN
12    NEW.price = NEW.price_precise::DECIMAL(10,2);
13  END IF;
14  RETURN NEW;
15END;
16$$ LANGUAGE plpgsql;
17
18CREATE TRIGGER sync_price_trigger
19BEFORE INSERT OR UPDATE ON orders
20FOR EACH ROW
21EXECUTE FUNCTION sync_price();
22
23-- Backfill
24UPDATE orders SET price_precise = price::DECIMAL(18,8);
25

Phase 2: Switch reads

typescript

1// Old code
2const order = await db.query('SELECT id, symbol, price FROM orders');
3
4// New code
5const order = await db.query(
6  'SELECT id, symbol, price_precise as price FROM orders'
7);
8

Phase 3: Swap columns

sql

1-- Migration v2
2DROP TRIGGER sync_price_trigger ON orders;
3DROP FUNCTION sync_price();
4
5ALTER TABLE orders DROP COLUMN price;
6ALTER TABLE orders RENAME COLUMN price_precise TO price;
7
8-- Recreate indexes
9CREATE INDEX idx_orders_price ON orders(price);
10

Index Management #

Indexes can lock tables for hours on large datasets.

Problem #

sql

1-- ❌ Locks table for 2 hours on 1B row table
2CREATE INDEX idx_trades_timestamp ON trades(execution_timestamp);
3

Solution: Concurrent indexes #

sql

1-- ✅ No table lock (builds in background)
2CREATE INDEX CONCURRENTLY idx_trades_timestamp 
3ON trades(execution_timestamp);
4
5-- Monitor progress
6SELECT 
7  schemaname,
8  tablename,
9  indexname,
10  pg_size_pretty(pg_relation_size(indexrelid)) as size
11FROM pg_stat_user_indexes
12WHERE indexrelname = 'idx_trades_timestamp';
13

Caveat: CONCURRENTLY requires more disk space and time but doesn't block writes.

Large Data Migrations #

Backfilling 1 billion rows requires careful batching.

❌ Wrong Approach #

sql

1-- Locks table, fills transaction log, runs for hours
2UPDATE trades 
3SET venue_id = (SELECT id FROM venues WHERE name = venue_name);
4

✅ Batched Approach #

sql

1-- Migration script
2DO $$
3DECLARE
4  batch_size INTEGER := 10000;
5  total_rows BIGINT;
6  processed BIGINT := 0;
7BEGIN
8  SELECT COUNT(*) INTO total_rows FROM trades WHERE venue_id IS NULL;
9  
10  WHILE processed < total_rows LOOP
11    UPDATE trades
12    SET venue_id = v.id
13    FROM venues v
14    WHERE trades.venue_name = v.name
15      AND trades.venue_id IS NULL
16      AND trades.id IN (
17        SELECT id FROM trades 
18        WHERE venue_id IS NULL 
19        LIMIT batch_size
20      );
21    
22    processed := processed + batch_size;
23    
24    -- Log progress
25    RAISE NOTICE 'Processed % / % rows (% %%)', 
26      processed, total_rows, (processed * 100 / total_rows);
27    
28    -- Prevent transaction log bloat
29    COMMIT;
30    
31    -- Throttle to avoid overwhelming database
32    PERFORM pg_sleep(0.1);
33  END LOOP;
34END $$;
35

Foreign Key Changes #

Adding foreign keys can lock tables.

Safe Approach #

sql

1-- Phase 1: Add constraint NOT VALID (doesn't scan table)
2ALTER TABLE orders 
3ADD CONSTRAINT fk_orders_accounts 
4FOREIGN KEY (account_id) REFERENCES accounts(id)
5NOT VALID;
6
7-- Phase 2: Validate in background (doesn't block writes)
8ALTER TABLE orders 
9VALIDATE CONSTRAINT fk_orders_accounts;
10

Schema Versioning #

Track migrations carefully:

sql

1CREATE TABLE schema_migrations (
2  version VARCHAR(50) PRIMARY KEY,
3  description TEXT,
4  applied_at TIMESTAMPTZ DEFAULT NOW(),
5  applied_by VARCHAR(100),
6  execution_time INTERVAL,
7  rollback_script TEXT
8);
9
10-- Record each migration
11INSERT INTO schema_migrations (version, description, rollback_script)
12VALUES (
13  '2024_11_01_001',
14  'Add execution_venue column to orders',
15  'ALTER TABLE orders DROP COLUMN execution_venue;'
16);
17

Testing Migrations #

1. Test with production data volume #

sql

1-- Create test table with same size
2CREATE TABLE trades_test AS 
3SELECT * FROM trades LIMIT 10000000;
4
5-- Test migration
6\timing on
7ALTER TABLE trades_test ADD COLUMN new_column VARCHAR(50);
8-- Output: Time: 45231.234 ms (45 seconds)
9

2. Test backward compatibility #

typescript

1// Simulation test
2describe('Migration compatibility', () => {
3  it('old code works with new schema', async () => {
4    // Apply new schema
5    await runMigration('2024_11_01_001');
6    
7    // Run old code
8    const result = await oldVersionCode.createOrder({
9      symbol: 'AAPL',
10      quantity: 100
11    });
12    
13    expect(result).toBeDefined();
14  });
15  
16  it('new code works with old schema', async () => {
17    // Don't apply migration
18    
19    // Run new code (should handle missing column)
20    const result = await newVersionCode.createOrder({
21      symbol: 'AAPL',
22      quantity: 100,
23      executionVenue: 'NYSE'
24    });
25    
26    expect(result).toBeDefined();
27  });
28});
29

Rollback Strategy #

Always have a rollback plan:

sql

1-- Migration up
2-- v2024_11_01_001_up.sql
3ALTER TABLE orders ADD COLUMN execution_venue VARCHAR(50);
4
5-- Migration down
6-- v2024_11_01_001_down.sql
7ALTER TABLE orders DROP COLUMN execution_venue;
8

For data migrations, store original values:

sql

1-- Before updating
2CREATE TABLE orders_backup_20241101 AS
3SELECT id, original_column FROM orders;
4
5-- If rollback needed
6UPDATE orders o
7SET original_column = b.original_column
8FROM orders_backup_20241101 b
9WHERE o.id = b.id;
10

Monitoring During Migration #

sql

1-- Check migration progress
2SELECT 
3  schemaname,
4  tablename,
5  n_tup_upd,
6  n_tup_hot_upd,
7  n_live_tup
8FROM pg_stat_user_tables
9WHERE tablename = 'orders';
10
11-- Monitor locks
12SELECT 
13  pid,
14  usename,
15  pg_blocking_pids(pid) as blocked_by,
16  query
17FROM pg_stat_activity
18WHERE datname = 'trading_db'
19  AND state = 'active';
20
21-- Check replication lag (if using replicas)
22SELECT 
23  client_addr,
24  state,
25  sync_state,
26  replay_lag
27FROM pg_stat_replication;
28

Real-World Example: Our Trading Platform #

Challenge: Migrate 800M trades to new schema with additional columns

Approach:

plaintext

1Day 1: Add nullable columns (5 minutes)
2Day 2-14: Backfill in batches (100k rows/minute, off-peak hours)
3Day 15: Add NOT NULL constraints (10 minutes)
4Day 16: Deploy new application code
5Day 30: Remove old columns (5 minutes)
6

Result:

Zero downtime
Zero data loss
No performance degradation
Completed in 30 days vs 6-hour maintenance window

Best Practices Checklist #

✅ Never deploy breaking changes
✅ Always deploy schema changes before code changes
✅ Use batched updates for large datasets
✅ Create indexes concurrently
✅ Add constraints as NOT VALID first
✅ Test with production data volumes
✅ Monitor locks and replication lag
✅ Have rollback scripts ready
✅ Document each migration
✅ Validate backward compatibility

Conclusion #

Zero-downtime migrations are possible with careful planning:

Expand - add new schema elements
Migrate - gradually move data
Contract - remove old schema

The key is maintaining backward compatibility at every step. It requires more migrations and longer timelines, but eliminates business risk.

Need Help?#

We specialize in:

Zero-downtime migration strategies
Large-scale data migrations (billions of rows)
PostgreSQL optimization for financial systems
Emergency migration recovery

Introduction #

The Challenge #

Traditional migration approach:

bash

1# ❌ Don't do this in production
21. Stop application
32. Run migrations
43. Start application
5

Downtime: 5-30 minutes
Revenue impact: $50k -$ 500k per hour for large platforms
Regulatory risk: Trading halts trigger regulatory scrutiny

The Zero-Downtime Pattern #

Core principle: Deploy in phases, never break backward compatibility

plaintext

1Phase 1: Expand (add new schema)
2Phase 2: Migrate (copy data)
3Phase 3: Contract (remove old schema)
4

Each phase is a separate deployment with full backward compatibility.

Example: Adding a Column #

❌ Breaking Change #

sql

1-- This breaks running code immediately
2ALTER TABLE orders ADD COLUMN execution_venue VARCHAR(50) NOT NULL;
3

✅ Zero-Downtime Approach #

Phase 1: Add nullable column

sql

1-- Migration v1
2ALTER TABLE orders ADD COLUMN execution_venue VARCHAR(50) NULL;
3
4-- Old code continues working (ignores new column)
5-- New code can start using it
6

Phase 2: Backfill data

sql

1-- Migration v2 (run during low traffic)
2UPDATE orders 
3SET execution_venue = 'NYSE' 
4WHERE execution_venue IS NULL 
5  AND symbol IN (SELECT symbol FROM nyse_listings);
6
7UPDATE orders 
8SET execution_venue = 'NASDAQ' 
9WHERE execution_venue IS NULL;
10

Phase 3: Add constraint

sql

1-- Migration v3 (after all old code deployed)
2ALTER TABLE orders 
3ALTER COLUMN execution_venue SET NOT NULL;
4
5CREATE INDEX idx_orders_venue ON orders(execution_venue);
6

Renaming Columns #

Renaming is especially tricky - old and new code must coexist.

The View Approach #

Phase 1: Create view with both names

sql

1-- Migration v1
2ALTER TABLE trades ADD COLUMN execution_timestamp TIMESTAMPTZ;
3
4-- Update new column when old column is written
5CREATE OR REPLACE FUNCTION sync_execution_time()
6RETURNS TRIGGER AS $$
7BEGIN
8  IF NEW.execution_time IS NOT NULL THEN
9    NEW.execution_timestamp = NEW.execution_time;
10  END IF;
11  RETURN NEW;
12END;
13$$ LANGUAGE plpgsql;
14
15CREATE TRIGGER sync_execution_time_trigger
16BEFORE INSERT OR UPDATE ON trades
17FOR EACH ROW
18EXECUTE FUNCTION sync_execution_time();
19
20-- Backfill existing data
21UPDATE trades 
22SET execution_timestamp = execution_time 
23WHERE execution_timestamp IS NULL;
24

Phase 2: Update application code

typescript

1// Old code (still running)
2const trade = await db.query(
3  'INSERT INTO trades (symbol, quantity, execution_time) VALUES ($1, $2, $3)',
4  ['AAPL', 100, new Date()]
5);
6
7// New code (gradual rollout)
8const trade = await db.query(
9  'INSERT INTO trades (symbol, quantity, execution_timestamp) VALUES ($1, $2, $3)',
10  ['AAPL', 100, new Date()]
11);
12

Phase 3: Remove old column

sql

1-- Migration v2 (after all old code replaced)
2DROP TRIGGER sync_execution_time_trigger ON trades;
3DROP FUNCTION sync_execution_time();
4ALTER TABLE trades DROP COLUMN execution_time;
5

Changing Column Types #

Even more complex - requires data transformation.

Example: Numeric precision change #

Current: price DECIMAL(10,2)
Needed: price DECIMAL(18,8) (for crypto trading)

Phase 1: Shadow column

sql

1-- Migration v1
2ALTER TABLE orders ADD COLUMN price_precise DECIMAL(18,8);
3
4-- Dual write trigger
5CREATE OR REPLACE FUNCTION sync_price()
6RETURNS TRIGGER AS $$
7BEGIN
8  IF NEW.price IS NOT NULL THEN
9    NEW.price_precise = NEW.price::DECIMAL(18,8);
10  END IF;
11  IF NEW.price_precise IS NOT NULL THEN
12    NEW.price = NEW.price_precise::DECIMAL(10,2);
13  END IF;
14  RETURN NEW;
15END;
16$$ LANGUAGE plpgsql;
17
18CREATE TRIGGER sync_price_trigger
19BEFORE INSERT OR UPDATE ON orders
20FOR EACH ROW
21EXECUTE FUNCTION sync_price();
22
23-- Backfill
24UPDATE orders SET price_precise = price::DECIMAL(18,8);
25

Phase 2: Switch reads

typescript

1// Old code
2const order = await db.query('SELECT id, symbol, price FROM orders');
3
4// New code
5const order = await db.query(
6  'SELECT id, symbol, price_precise as price FROM orders'
7);
8

Phase 3: Swap columns

sql

1-- Migration v2
2DROP TRIGGER sync_price_trigger ON orders;
3DROP FUNCTION sync_price();
4
5ALTER TABLE orders DROP COLUMN price;
6ALTER TABLE orders RENAME COLUMN price_precise TO price;
7
8-- Recreate indexes
9CREATE INDEX idx_orders_price ON orders(price);
10

Index Management #

Indexes can lock tables for hours on large datasets.

Problem #

sql

1-- ❌ Locks table for 2 hours on 1B row table
2CREATE INDEX idx_trades_timestamp ON trades(execution_timestamp);
3

Solution: Concurrent indexes #

sql

1-- ✅ No table lock (builds in background)
2CREATE INDEX CONCURRENTLY idx_trades_timestamp 
3ON trades(execution_timestamp);
4
5-- Monitor progress
6SELECT 
7  schemaname,
8  tablename,
9  indexname,
10  pg_size_pretty(pg_relation_size(indexrelid)) as size
11FROM pg_stat_user_indexes
12WHERE indexrelname = 'idx_trades_timestamp';
13

Caveat: CONCURRENTLY requires more disk space and time but doesn't block writes.

Large Data Migrations #

Backfilling 1 billion rows requires careful batching.

❌ Wrong Approach #

sql

1-- Locks table, fills transaction log, runs for hours
2UPDATE trades 
3SET venue_id = (SELECT id FROM venues WHERE name = venue_name);
4

✅ Batched Approach #

sql

1-- Migration script
2DO $$
3DECLARE
4  batch_size INTEGER := 10000;
5  total_rows BIGINT;
6  processed BIGINT := 0;
7BEGIN
8  SELECT COUNT(*) INTO total_rows FROM trades WHERE venue_id IS NULL;
9  
10  WHILE processed < total_rows LOOP
11    UPDATE trades
12    SET venue_id = v.id
13    FROM venues v
14    WHERE trades.venue_name = v.name
15      AND trades.venue_id IS NULL
16      AND trades.id IN (
17        SELECT id FROM trades 
18        WHERE venue_id IS NULL 
19        LIMIT batch_size
20      );
21    
22    processed := processed + batch_size;
23    
24    -- Log progress
25    RAISE NOTICE 'Processed % / % rows (% %%)', 
26      processed, total_rows, (processed * 100 / total_rows);
27    
28    -- Prevent transaction log bloat
29    COMMIT;
30    
31    -- Throttle to avoid overwhelming database
32    PERFORM pg_sleep(0.1);
33  END LOOP;
34END $$;
35

Foreign Key Changes #

Adding foreign keys can lock tables.

Safe Approach #

sql

1-- Phase 1: Add constraint NOT VALID (doesn't scan table)
2ALTER TABLE orders 
3ADD CONSTRAINT fk_orders_accounts 
4FOREIGN KEY (account_id) REFERENCES accounts(id)
5NOT VALID;
6
7-- Phase 2: Validate in background (doesn't block writes)
8ALTER TABLE orders 
9VALIDATE CONSTRAINT fk_orders_accounts;
10

Schema Versioning #

Track migrations carefully:

sql

1CREATE TABLE schema_migrations (
2  version VARCHAR(50) PRIMARY KEY,
3  description TEXT,
4  applied_at TIMESTAMPTZ DEFAULT NOW(),
5  applied_by VARCHAR(100),
6  execution_time INTERVAL,
7  rollback_script TEXT
8);
9
10-- Record each migration
11INSERT INTO schema_migrations (version, description, rollback_script)
12VALUES (
13  '2024_11_01_001',
14  'Add execution_venue column to orders',
15  'ALTER TABLE orders DROP COLUMN execution_venue;'
16);
17

Testing Migrations #

1. Test with production data volume #

sql

1-- Create test table with same size
2CREATE TABLE trades_test AS 
3SELECT * FROM trades LIMIT 10000000;
4
5-- Test migration
6\timing on
7ALTER TABLE trades_test ADD COLUMN new_column VARCHAR(50);
8-- Output: Time: 45231.234 ms (45 seconds)
9

2. Test backward compatibility #

typescript

1// Simulation test
2describe('Migration compatibility', () => {
3  it('old code works with new schema', async () => {
4    // Apply new schema
5    await runMigration('2024_11_01_001');
6    
7    // Run old code
8    const result = await oldVersionCode.createOrder({
9      symbol: 'AAPL',
10      quantity: 100
11    });
12    
13    expect(result).toBeDefined();
14  });
15  
16  it('new code works with old schema', async () => {
17    // Don't apply migration
18    
19    // Run new code (should handle missing column)
20    const result = await newVersionCode.createOrder({
21      symbol: 'AAPL',
22      quantity: 100,
23      executionVenue: 'NYSE'
24    });
25    
26    expect(result).toBeDefined();
27  });
28});
29

Rollback Strategy #

Always have a rollback plan:

sql

1-- Migration up
2-- v2024_11_01_001_up.sql
3ALTER TABLE orders ADD COLUMN execution_venue VARCHAR(50);
4
5-- Migration down
6-- v2024_11_01_001_down.sql
7ALTER TABLE orders DROP COLUMN execution_venue;
8

For data migrations, store original values:

sql

1-- Before updating
2CREATE TABLE orders_backup_20241101 AS
3SELECT id, original_column FROM orders;
4
5-- If rollback needed
6UPDATE orders o
7SET original_column = b.original_column
8FROM orders_backup_20241101 b
9WHERE o.id = b.id;
10

Monitoring During Migration #

sql

1-- Check migration progress
2SELECT 
3  schemaname,
4  tablename,
5  n_tup_upd,
6  n_tup_hot_upd,
7  n_live_tup
8FROM pg_stat_user_tables
9WHERE tablename = 'orders';
10
11-- Monitor locks
12SELECT 
13  pid,
14  usename,
15  pg_blocking_pids(pid) as blocked_by,
16  query
17FROM pg_stat_activity
18WHERE datname = 'trading_db'
19  AND state = 'active';
20
21-- Check replication lag (if using replicas)
22SELECT 
23  client_addr,
24  state,
25  sync_state,
26  replay_lag
27FROM pg_stat_replication;
28

Real-World Example: Our Trading Platform #

Challenge: Migrate 800M trades to new schema with additional columns

Approach:

plaintext

1Day 1: Add nullable columns (5 minutes)
2Day 2-14: Backfill in batches (100k rows/minute, off-peak hours)
3Day 15: Add NOT NULL constraints (10 minutes)
4Day 16: Deploy new application code
5Day 30: Remove old columns (5 minutes)
6

Result:

Zero downtime
Zero data loss
No performance degradation
Completed in 30 days vs 6-hour maintenance window

Best Practices Checklist #

Conclusion #

Zero-downtime migrations are possible with careful planning:

Expand - add new schema elements
Migrate - gradually move data
Contract - remove old schema

The key is maintaining backward compatibility at every step. It requires more migrations and longer timelines, but eliminates business risk.

Need Help?#

We specialize in:

Zero-downtime migration strategies
Large-scale data migrations (billions of rows)
PostgreSQL optimization for financial systems
Emergency migration recovery

NordVarg Team

Join 1,000+ Engineers

Related Posts

NordVarg Team

Join 1,000+ Engineers

Related Posts