On June 8, 2021, a single misconfigured CDN rule at Fastly triggered a cascading failure that took down major websites worldwide for nearly 6 hours. Among the casualties: several cryptocurrency exchanges that relied solely on AWS us-east-1.
One exchange lost $440 million in trading volume during the outage. Traders couldn't access their accounts. Market makers couldn't update quotes. Arbitrage opportunities went unexploited. The exchange's reputation took years to recover.
A competitor running multi-cloud infrastructure (AWS + GCP) experienced zero downtime. When AWS failed, traffic automatically routed to GCP. Traders didn't notice. The exchange gained 15,000 new users that week—refugees from the failed exchange.
This incident catalyzed industry-wide adoption of multi-cloud disaster recovery. Today, regulators in multiple jurisdictions require it. Cyber insurance won't cover single-cloud deployments. This article covers how to build multi-cloud DR for exchanges and fintech platforms, with real architecture, cost analysis, and lessons from production failures.
Single-cloud risks:
Multi-cloud benefits:
1┌──────────────────────────────────────────────────────────────┐
2│ Global Load Balancer │
3│ (Cloudflare / AWS Global Accelerator) │
4└────────────────────┬─────────────────┬───────────────────────┘
5 │ │
6 ┌───────────▼────────┐ ┌─────▼──────────────┐
7 │ AWS (Primary) │ │ GCP (Active) │
8 │ │ │ │
9 │ ┌──────────────┐ │ │ ┌──────────────┐ │
10 │ │ API Gateway │ │ │ │ API Gateway │ │
11 │ └──────┬───────┘ │ │ └──────┬───────┘ │
12 │ │ │ │ │ │
13 │ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
14 │ │ Order Router │ │ │ │ Order Router │ │
15 │ └──────┬───────┘ │ │ └──────┬───────┘ │
16 │ │ │ │ │ │
17 │ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
18 │ │ PostgreSQL │◄─┼──┼─►│ PostgreSQL │ │
19 │ │ (Primary) │ │ │ │ (Replica) │ │
20 │ └──────────────┘ │ │ └──────────────┘ │
21 └────────────────────┘ └────────────────────┘
22 │ │
23 └────────┬────────┘
24 │
25 ┌─────────▼──────────┐
26 │ Kafka (Mirrored) │
27 │ Event Replication │
28 └────────────────────┘
29Key components:
1-- On AWS (primary)
2CREATE PUBLICATION exchange_pub FOR ALL TABLES;
3
4-- On GCP (replica)
5CREATE SUBSCRIPTION exchange_sub
6CONNECTION 'host=aws-db.example.com port=5432 dbname=exchange user=replicator'
7PUBLICATION exchange_pub;
8Replication lag: Typically 100-500ms Failover time: 5-10 seconds (automatic) Data loss: Near-zero (asynchronous replication)
1# MirrorMaker 2 configuration
2clusters:
3 aws:
4 bootstrap.servers: aws-kafka:9092
5 gcp:
6 bootstrap.servers: gcp-kafka:9092
7
8mirrors:
9 - source: aws
10 target: gcp
11 topics: "orders|trades|market-data"
12 sync.group.offsets.enabled: true
13Benefits:
Exchange profile:
Incidents (2020-2021):
Phase 1: Infrastructure (3 months)
Phase 2: Traffic migration (2 months)
Phase 3: Failover testing (1 month)
Results:
Problem 1: Replication lag spikes
During high-volume periods (market volatility), replication lag spiked to 5+ seconds, causing stale data on GCP.
Solution: Increased PostgreSQL replication slots, tuned max_wal_senders, and implemented lag monitoring with alerts.
Problem 2: Split-brain during network partition
When the inter-cloud link failed, both regions thought they were primary, causing duplicate orders.
Solution: Implemented distributed consensus (Raft) for leader election. Only one region can be primary at a time.
Problem 3: Cost overruns
Cross-region data transfer costs were 3x higher than expected ($50K/month).
Solution: Implemented data compression, reduced replication frequency for non-critical data, and negotiated volume discounts.
Don't wait for a real outage to test failover. Use chaos engineering to validate continuously.
1. Region failure
1# Shut down entire AWS region
2kubectl delete namespace production --context=aws
3
4# Verify traffic routes to GCP
5curl -I https://api.exchange.com
6# Should return GCP server headers
72. Database replication failure
1-- Simulate replication lag
2SELECT pg_sleep(10); -- On primary
3-- Check replica lag
4SELECT now() - pg_last_xact_replay_timestamp() AS lag;
53. Network partition
1# Block traffic between AWS and GCP
2iptables -A OUTPUT -d gcp-subnet -j DROP
3
4# Verify system continues operating
5# Check for split-brain conditions
6Frequency: Weekly automated tests, monthly manual drills
Small fintech (10 services, 1,000 req/sec):
Medium exchange (100 services, 10,000 req/sec):
Large platform (500 services, 100,000 req/sec):
ROI calculation:
Don't migrate everything at once. Start with:
This reduces risk and allows learning before migrating critical systems.
Track:
Alert if lag exceeds 1 second or failover rate exceeds 1/day.
Automated tests aren't enough. Run manual failover drills monthly:
Multi-cloud disaster recovery transformed from "nice-to-have" to "mandatory" after the 2021 outages. The $440M cost of a single outage dwarfs the 60% infrastructure premium.
Start with active-passive (cheaper, simpler), then migrate to active-active (zero downtime, better performance). Test failover continuously—don't wait for a real outage to discover your DR doesn't work.
The exchange that survived the Fastly outage gained 15,000 users. The one that failed lost millions. Choose wisely.
Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.