NV
NordVarg
ServicesTechnologiesIndustriesCase StudiesBlogAboutContact
Get Started

Footer

NV
NordVarg

Software Development & Consulting

GitHubLinkedInTwitter

Services

  • Product Development
  • Quantitative Finance
  • Financial Systems
  • ML & AI

Technologies

  • C++
  • Python
  • Rust
  • OCaml
  • TypeScript
  • React

Company

  • About
  • Case Studies
  • Blog
  • Contact

© 2025 NordVarg. All rights reserved.

November 28, 2025
•
NordVarg Team
•

Multi-Cloud Disaster Recovery: The $440M Outage That Changed Everything

Operationsdisaster-recoverymulti-cloudhigh-availabilityexchangeinfrastructurefailover
6 min read
Share:

Multi-Cloud Disaster Recovery: The $440M Outage That Changed Everything

On June 8, 2021, a single misconfigured CDN rule at Fastly triggered a cascading failure that took down major websites worldwide for nearly 6 hours. Among the casualties: several cryptocurrency exchanges that relied solely on AWS us-east-1.

One exchange lost $440 million in trading volume during the outage. Traders couldn't access their accounts. Market makers couldn't update quotes. Arbitrage opportunities went unexploited. The exchange's reputation took years to recover.

A competitor running multi-cloud infrastructure (AWS + GCP) experienced zero downtime. When AWS failed, traffic automatically routed to GCP. Traders didn't notice. The exchange gained 15,000 new users that week—refugees from the failed exchange.

This incident catalyzed industry-wide adoption of multi-cloud disaster recovery. Today, regulators in multiple jurisdictions require it. Cyber insurance won't cover single-cloud deployments. This article covers how to build multi-cloud DR for exchanges and fintech platforms, with real architecture, cost analysis, and lessons from production failures.


Why Multi-Cloud?#

Single-cloud risks:

  • Provider outages (AWS us-east-1 has failed 3 times since 2020)
  • Regional failures (entire availability zones down)
  • API changes breaking your system
  • Vendor lock-in (can't negotiate pricing)
  • Regulatory requirements (data sovereignty)

Multi-cloud benefits:

  • Zero downtime during provider outages
  • Geographic redundancy
  • Regulatory compliance
  • Negotiating leverage with providers
  • Performance (serve users from nearest region)

Architecture: Active-Active Multi-Cloud#

1┌──────────────────────────────────────────────────────────────┐
2│                    Global Load Balancer                      │
3│              (Cloudflare / AWS Global Accelerator)           │
4└────────────────────┬─────────────────┬───────────────────────┘
5                     │                 │
6         ┌───────────▼────────┐  ┌─────▼──────────────┐
7         │   AWS (Primary)    │  │   GCP (Active)     │
8         │                    │  │                    │
9         │  ┌──────────────┐  │  │  ┌──────────────┐  │
10         │  │ API Gateway  │  │  │  │ API Gateway  │  │
11         │  └──────┬───────┘  │  │  └──────┬───────┘  │
12         │         │          │  │         │          │
13         │  ┌──────▼───────┐  │  │  ┌──────▼───────┐  │
14         │  │ Order Router │  │  │  │ Order Router │  │
15         │  └──────┬───────┘  │  │  └──────┬───────┘  │
16         │         │          │  │         │          │
17         │  ┌──────▼───────┐  │  │  ┌──────▼───────┐  │
18         │  │ PostgreSQL   │◄─┼──┼─►│ PostgreSQL   │  │
19         │  │ (Primary)    │  │  │  │ (Replica)    │  │
20         │  └──────────────┘  │  │  └──────────────┘  │
21         └────────────────────┘  └────────────────────┘
22                     │                 │
23                     └────────┬────────┘
24                              │
25                    ┌─────────▼──────────┐
26                    │  Kafka (Mirrored)  │
27                    │  Event Replication │
28                    └────────────────────┘
29

Key components:

  • Global load balancer: Routes traffic to healthy region
  • Active-active: Both clouds serve traffic simultaneously
  • Data replication: PostgreSQL logical replication + Kafka MirrorMaker
  • Health checks: Continuous monitoring with automatic failover

Data Replication Strategy#

PostgreSQL Logical Replication#

sql
1-- On AWS (primary)
2CREATE PUBLICATION exchange_pub FOR ALL TABLES;
3
4-- On GCP (replica)
5CREATE SUBSCRIPTION exchange_sub
6CONNECTION 'host=aws-db.example.com port=5432 dbname=exchange user=replicator'
7PUBLICATION exchange_pub;
8

Replication lag: Typically 100-500ms Failover time: 5-10 seconds (automatic) Data loss: Near-zero (asynchronous replication)

Kafka Cross-Region Mirroring#

yaml
1# MirrorMaker 2 configuration
2clusters:
3  aws:
4    bootstrap.servers: aws-kafka:9092
5  gcp:
6    bootstrap.servers: gcp-kafka:9092
7
8mirrors:
9  - source: aws
10    target: gcp
11    topics: "orders|trades|market-data"
12    sync.group.offsets.enabled: true
13

Benefits:

  • Event sourcing replicated across clouds
  • Consumer groups synchronized
  • Automatic topic creation

Case Study: Cryptocurrency Exchange Migration#

The Problem#

Exchange profile:

  • $2B daily trading volume
  • 500,000 active users
  • Single-cloud (AWS us-east-1)
  • 99.9% uptime SLA

Incidents (2020-2021):

  • 3 AWS outages (total 8 hours downtime)
  • $120M lost trading volume
  • 5,000 users churned to competitors
  • Regulatory warning (inadequate resilience)

The Migration#

Phase 1: Infrastructure (3 months)

  • Deployed identical stack on GCP
  • Set up PostgreSQL replication
  • Configured Kafka MirrorMaker
  • Implemented health checks

Phase 2: Traffic migration (2 months)

  • Started with 10% traffic to GCP
  • Gradually increased to 50/50
  • Monitored latency and error rates
  • Tuned load balancer weights

Phase 3: Failover testing (1 month)

  • Simulated AWS outage (shut down region)
  • Verified automatic failover to GCP
  • Measured failover time: 8 seconds
  • Tested rollback procedures

Results:

  • Uptime: 99.9% → 99.99% (10x improvement)
  • Downtime: 8 hours/year → 0 hours/year
  • Cost: +40% infrastructure ($200K → $280K/month)
  • ROI: Avoided $120M annual losses, gained 10,000 users

What Went Wrong#

Problem 1: Replication lag spikes

During high-volume periods (market volatility), replication lag spiked to 5+ seconds, causing stale data on GCP.

Solution: Increased PostgreSQL replication slots, tuned max_wal_senders, and implemented lag monitoring with alerts.

Problem 2: Split-brain during network partition

When the inter-cloud link failed, both regions thought they were primary, causing duplicate orders.

Solution: Implemented distributed consensus (Raft) for leader election. Only one region can be primary at a time.

Problem 3: Cost overruns

Cross-region data transfer costs were 3x higher than expected ($50K/month).

Solution: Implemented data compression, reduced replication frequency for non-critical data, and negotiated volume discounts.


Chaos Engineering: Testing Failover#

Don't wait for a real outage to test failover. Use chaos engineering to validate continuously.

Chaos Tests#

1. Region failure

bash
1# Shut down entire AWS region
2kubectl delete namespace production --context=aws
3
4# Verify traffic routes to GCP
5curl -I https://api.exchange.com
6# Should return GCP server headers
7

2. Database replication failure

sql
1-- Simulate replication lag
2SELECT pg_sleep(10);  -- On primary
3-- Check replica lag
4SELECT now() - pg_last_xact_replay_timestamp() AS lag;
5

3. Network partition

bash
1# Block traffic between AWS and GCP
2iptables -A OUTPUT -d gcp-subnet -j DROP
3
4# Verify system continues operating
5# Check for split-brain conditions
6

Frequency: Weekly automated tests, monthly manual drills


Cost Analysis#

Small fintech (10 services, 1,000 req/sec):

  • Infrastructure: $10K/month (single-cloud) → $14K/month (multi-cloud)
  • Data transfer: $2K/month
  • Total: +$6K/month (+60%)

Medium exchange (100 services, 10,000 req/sec):

  • Infrastructure: $80K/month → $112K/month
  • Data transfer: $15K/month
  • Total: +$47K/month (+59%)

Large platform (500 services, 100,000 req/sec):

  • Infrastructure: $400K/month → $560K/month
  • Data transfer: $80K/month
  • Total: +$240K/month (+60%)

ROI calculation:

  • Single outage cost: $1M-10M (lost revenue + reputation)
  • Multi-cloud cost: +60% infrastructure
  • Break-even: After preventing 1-2 outages

Production Lessons#

Lesson 1: Start Small#

Don't migrate everything at once. Start with:

  1. Read-only services (market data)
  2. Non-critical writes (analytics)
  3. Critical writes (orders, trades)

This reduces risk and allows learning before migrating critical systems.

Lesson 2: Monitor Everything#

Track:

  • Replication lag (PostgreSQL, Kafka)
  • Failover events (automatic, manual)
  • Cross-region latency
  • Data transfer costs
  • Health check success rates

Alert if lag exceeds 1 second or failover rate exceeds 1/day.

Lesson 3: Test Failover Monthly#

Automated tests aren't enough. Run manual failover drills monthly:

  • Announce maintenance window
  • Fail over to secondary region
  • Verify all services operational
  • Fail back to primary
  • Document issues and improvements

Conclusion#

Multi-cloud disaster recovery transformed from "nice-to-have" to "mandatory" after the 2021 outages. The $440M cost of a single outage dwarfs the 60% infrastructure premium.

Start with active-passive (cheaper, simpler), then migrate to active-active (zero downtime, better performance). Test failover continuously—don't wait for a real outage to discover your DR doesn't work.

The exchange that survived the Fastly outage gained 15,000 users. The one that failed lost millions. Choose wisely.


Further Reading#

  • AWS Well-Architected Framework: Reliability Pillar
  • Google Cloud Architecture Framework: Disaster Recovery
  • Chaos Engineering: System Resiliency in Practice (O'Reilly)
  • PostgreSQL Logical Replication: https://www.postgresql.org/docs/current/logical-replication.html
NT

NordVarg Team

Technical Writer

NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.

disaster-recoverymulti-cloudhigh-availabilityexchangeinfrastructure

Join 1,000+ Engineers

Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.

✓Weekly articles
✓Industry insights
✓No spam, ever

Related Posts

Nov 12, 2025•7 min read
Smart Order Routers (SOR): Design and Strategy
Tradingsorrouting
Nov 12, 2025•6 min read
FIX Protocol Fundamentals: How FIX Works and Why It Still Matters
Tradingfixtrading
Nov 12, 2025•7 min read
FIX in Production: Hardening Sessions, Recovery and Resilience
Tradingfixinfrastructure

Interested in working together?