Building Distributed Backtesting Infrastructure: From 18 Hours to 52 Minutes

In 2021, a quantitative hedge fund was testing a new mean-reversion strategy. They wanted to optimize across 1,000 parameter combinations (different lookback periods, entry thresholds, stop-losses) to find the best configuration. On their 16-core server, the full parameter sweep took 18.5 hours. By the time results came back, the market had moved, and the optimal parameters might have changed.

They couldn't iterate fast enough. Research velocity was their bottleneck—not ideas, not data, but compute. A researcher would submit a backtest Friday afternoon and get results Monday morning. Three-day feedback loops killed productivity.

We built a distributed backtesting system using Ray, scaling to 320 CPUs across 20 cloud instances. The same 1,000-parameter sweep now completes in 52 minutes—a 21x speedup. Researchers iterate 20+ times per day instead of once every 3 days. Research velocity increased 60x.

But distributed backtesting isn't free. The infrastructure cost $1.20/hour ($28/day if running continuously). The engineering effort was 4 months. And debugging distributed systems is 10x harder than debugging single-threaded code. This article covers when distributed backtesting is worth it, how to build it, and what goes wrong.

Why Distributed Backtesting?#

Single-threaded backtesting hits limits quickly:

The Parameter Sweep Problem #

Testing a strategy with 4 parameters, each with 10 possible values, creates 10,000 combinations. If each backtest takes 10 seconds, that's 27.8 hours. Add walk-forward analysis (splitting data into 10 periods), and you're at 278 hours (11.5 days).

Example: Mean-reversion strategy parameters

Lookback period: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] (10 values)
Entry threshold (z-score): [1.5, 1.75, 2.0, 2.25, 2.5] (5 values)
Exit threshold: [0.25, 0.5, 0.75, 1.0] (4 values)
Stop-loss: [0.02, 0.03, 0.05, 0.10] (4 values)

Total combinations: 10 × 5 × 4 × 4 = 800

At 10 seconds per backtest: 800 × 10s = 8,000s = 2.2 hours (single-threaded)

With 20 workers: 8,000s / 20 = 400s = 6.7 minutes

The speedup is linear with worker count (assuming no overhead).

The Multi-Asset Problem #

Backtesting a portfolio of 100 stocks requires simulating all 100 simultaneously, tracking correlations, rebalancing, and risk management. This is 100x more expensive than backtesting a single stock.

For a universe of 500 stocks, testing 100 different portfolio construction methods takes weeks on a single machine. Distributed systems make it feasible.

The Monte Carlo Problem #

Some strategies require Monte Carlo simulation: run the same strategy 1,000+ times with different random seeds to estimate distribution of outcomes. This is embarrassingly parallel—each simulation is independent.

Architecture: Ray for Distributed Backtesting #

We use Ray, a distributed computing framework that makes parallelization trivial. Ray handles:

Task scheduling across workers
Data sharing via object store
Fault tolerance (automatic retry on failure)
Resource management (CPU/GPU allocation)

Why Ray Over Alternatives?#

Alternatives considered:

Multiprocessing: Limited to single machine, no fault tolerance
Dask: Good for dataframes, awkward for custom logic
Spark: JVM overhead, complex setup, overkill for our use case
Kubernetes Jobs: Manual orchestration, no shared state

Ray wins because:

Scales from laptop to cluster seamlessly
Python-native (no JVM)
Shared object store (avoid data duplication)
Actor model for stateful workers
Active development and community

Core Implementation #

python

1import ray
2import pandas as pd
3import numpy as np
4from typing import Dict, List
5from dataclasses import dataclass
6import time
7
8@dataclass
9class BacktestConfig:
10    """Configuration for a single backtest"""
11    strategy_name: str
12    parameters: Dict
13    start_date: str
14    end_date: str
15    initial_capital: float = 1_000_000.0
16    transaction_cost_bps: float = 5.0  # 5 basis points
17
18@ray.remote
19class BacktestWorker:
20    """
21    Ray actor that runs backtests.
22    Each worker maintains its own data cache.
23    """
24    
25    def __init__(self, data_ref):
26        """
27        Args:
28            data_ref: Ray object reference to shared market data
29        """
30        # Load data from shared object store (zero-copy)
31        self.market_data = ray.get(data_ref)
32        print(f"Worker initialized with {len(self.market_data)} rows of data")
33    
34    def run_backtest(self, config: BacktestConfig) -> Dict:
35        """
36        Execute a single backtest.
37        
38        Returns:
39            Dictionary with performance metrics
40        """
41        start_time = time.time()
42        
43        # Filter data for backtest period
44        data = self.market_data[
45            (self.market_data['date'] >= config.start_date) &
46            (self.market_data['date'] <= config.end_date)
47        ].copy()
48        
49        # Run strategy
50        results = self._execute_strategy(data, config)
51        
52        # Calculate metrics
53        metrics = self._calculate_metrics(results, config)
54        metrics['backtest_time_seconds'] = time.time() - start_time
55        metrics['parameters'] = config.parameters
56        
57        return metrics
58    
59    def _execute_strategy(self, data, config):
60        """Execute trading strategy logic"""
61        portfolio_value = [config.initial_capital]
62        positions = []
63        cash = config.initial_capital
64        shares = 0
65        
66        params = config.parameters
67        
68        for i in range(len(data)):
69            row = data.iloc[i]
70            
71            # Calculate indicators
72            if i < params['lookback_period']:
73                positions.append(0)
74                portfolio_value.append(cash + shares * row['close'])
75                continue
76            
77            # Mean reversion signal
78            recent_prices = data.iloc[i-params['lookback_period']:i]['close']
79            mean_price = recent_prices.mean()
80            std_price = recent_prices.std()
81            
82            if std_price == 0:
83                z_score = 0
84            else:
85                z_score = (row['close'] - mean_price) / std_price
86            
87            # Trading logic
88            signal = 0
89            if shares == 0:  # No position
90                if z_score < -params['entry_threshold']:
91                    signal = 1  # Buy
92                elif z_score > params['entry_threshold']:
93                    signal = -1  # Short (if allowed)
94            else:  # Have position
95                if abs(z_score) < params['exit_threshold']:
96                    signal = -np.sign(shares)  # Close position
97                elif shares > 0 and z_score < -params['stop_loss_z']:
98                    signal = -1  # Stop loss on long
99                elif shares < 0 and z_score > params['stop_loss_z']:
100                    signal = 1  # Stop loss on short
101            
102            # Execute trade
103            if signal != 0:
104                trade_shares = int(cash / row['close']) if signal > 0 else -shares
105                trade_cost = abs(trade_shares) * row['close'] * (config.transaction_cost_bps / 10000)
106                
107                cash -= trade_shares * row['close'] + trade_cost
108                shares += trade_shares
109            
110            positions.append(shares)
111            portfolio_value.append(cash + shares * row['close'])
112        
113        return {
114            'portfolio_value': portfolio_value,
115            'positions': positions,
116            'final_value': portfolio_value[-1]
117        }
118    
119    def _calculate_metrics(self, results, config):
120        """Calculate performance metrics"""
121        pv = pd.Series(results['portfolio_value'])
122        returns = pv.pct_change().dropna()
123        
124        # Count trades
125        positions = results['positions']
126        num_trades = sum(1 for i in range(1, len(positions)) 
127                        if positions[i] != positions[i-1])
128        
129        # Calculate metrics
130        total_return = (results['final_value'] / config.initial_capital) - 1
131        
132        if len(returns) > 0 and returns.std() > 0:
133            sharpe_ratio = returns.mean() / returns.std() * np.sqrt(252)
134        else:
135            sharpe_ratio = 0
136        
137        # Maximum drawdown
138        cummax = pv.cummax()
139        drawdown = (pv - cummax) / cummax
140        max_drawdown = drawdown.min()
141        
142        return {
143            'total_return': total_return,
144            'sharpe_ratio': sharpe_ratio,
145            'max_drawdown': max_drawdown,
146            'num_trades': num_trades,
147            'final_value': results['final_value']
148        }
149
150
151class DistributedBacktester:
152    """
153    Coordinator for distributed backtesting.
154    """
155    
156    def __init__(self, n_workers: int = 20):
157        """
158        Args:
159            n_workers: Number of parallel workers
160        """
161        # Initialize Ray
162        if not ray.is_initialized():
163            ray.init(ignore_reinit_error=True)
164        
165        print(f"Initializing {n_workers} workers...")
166        
167        # Load and share market data
168        market_data = self._load_market_data()
169        self.data_ref = ray.put(market_data)  # Store in object store
170        
171        # Create worker pool
172        self.workers = [
173            BacktestWorker.remote(self.data_ref)
174            for _ in range(n_workers)
175        ]
176        
177        print(f"Workers initialized. Data size: {len(market_data)} rows")
178    
179    def _load_market_data(self):
180        """Load market data (in production: from database)"""
181        # Simulated data for example
182        dates = pd.date_range('2010-01-01', '2024-01-01', freq='D')
183        data = pd.DataFrame({
184            'date': dates,
185            'close': 100 + np.cumsum(np.random.randn(len(dates)) * 0.5)
186        })
187        return data
188    
189    def run_parameter_sweep(self,
190                           strategy_name: str,
191                           parameter_grid: Dict[str, List],
192                           start_date: str,
193                           end_date: str) -> pd.DataFrame:
194        """
195        Run backtest across all parameter combinations.
196        
197        Args:
198            strategy_name: Name of strategy
199            parameter_grid: Dict mapping parameter names to lists of values
200            start_date, end_date: Backtest period
201        
202        Returns:
203            DataFrame with results sorted by Sharpe ratio
204        """
205        # Generate all parameter combinations
206        from itertools import product
207        
208        param_names = list(parameter_grid.keys())
209        param_values = list(parameter_grid.values())
210        
211        configs = []
212        for values in product(*param_values):
213            params = dict(zip(param_names, values))
214            configs.append(BacktestConfig(
215                strategy_name=strategy_name,
216                parameters=params,
217                start_date=start_date,
218                end_date=end_date
219            ))
220        
221        print(f"Running {len(configs)} backtests across {len(self.workers)} workers...")
222        start_time = time.time()
223        
224        # Distribute work across workers (round-robin)
225        tasks = []
226        for i, config in enumerate(configs):
227            worker = self.workers[i % len(self.workers)]
228            task = worker.run_backtest.remote(config)
229            tasks.append(task)
230        
231        # Collect results
232        results = ray.get(tasks)
233        
234        elapsed = time.time() - start_time
235        print(f"Completed in {elapsed:.1f} seconds ({len(configs)/elapsed:.1f} backtests/sec)")
236        
237        # Convert to DataFrame
238        df = pd.DataFrame(results)
239        return df.sort_values('sharpe_ratio', ascending=False)
240    
241    def shutdown(self):
242        """Clean up Ray resources"""
243        ray.shutdown()
244
245
246# Example usage
247if __name__ == "__main__":
248    # Create distributed backtester
249    backtester = DistributedBacktester(n_workers=20)
250    
251    # Define parameter grid
252    parameter_grid = {
253        'lookback_period': [20, 40, 60, 80, 100],
254        'entry_threshold': [1.5, 2.0, 2.5],
255        'exit_threshold': [0.25, 0.5, 0.75],
256        'stop_loss_z': [2.5, 3.0, 3.5]
257    }
258    
259    # Run parameter sweep
260    results = backtester.run_parameter_sweep(
261        strategy_name='mean_reversion',
262        parameter_grid=parameter_grid,
263        start_date='2015-01-01',
264        end_date='2023-12-31'
265    )
266    
267    print("\nTop 10 parameter combinations:")
268    print(results.head(10)[['sharpe_ratio', 'total_return', 'max_drawdown', 'num_trades', 'parameters']])
269    
270    backtester.shutdown()
271

Key design decisions:

Ray actors (not tasks): Workers maintain state (data cache)
Shared object store: Market data loaded once, shared across workers (zero-copy)
Round-robin distribution: Simple load balancing
Synchronous collection: Wait for all results before returning

Case Study: Hedge Fund Parameter Optimization #

The Problem #

A $200M quantitative hedge fund runs 50+ strategies across multiple asset classes. Each strategy has 5-10 tunable parameters. They want to:

Optimize parameters monthly (as market regimes change)
Test robustness via walk-forward analysis (10 periods)
Run Monte Carlo simulations (1,000 paths per strategy)

Scale: 50 strategies × 1,000 parameter combinations × 10 walk-forward periods = 500,000 backtests

Single-threaded time: 500,000 × 10 seconds = 5,000,000 seconds = 1,389 hours = 58 days

Clearly infeasible.

The Solution #

Infrastructure:

20 AWS EC2 instances (c5.4xlarge: 16 vCPUs each)
Total: 320 CPUs
Cost: $1.20/hour for full cluster
Ray cluster with shared Redis for coordination

Results:

500,000 backtests completed in 4.2 hours
Throughput: ~2,000 backtests/minute
Cost per run: $5.04
Runs monthly: $60/year

ROI:

Engineering cost: $150K (4 months, 2 engineers)
Infrastructure cost: $60/year
Value: Enabled monthly reoptimization (previously impossible)
Break-even: Immediate (research velocity improvement alone justified cost)

What Went Wrong #

Problem 1: Data transfer bottleneck

Initially, each worker loaded market data independently from S3. With 320 workers, this created 320 simultaneous S3 requests, overwhelming the bucket and causing throttling.

Solution: Load data once on the coordinator, store in Ray's object store, share references to workers. This reduced data transfer from 320× to 1×.

Problem 2: Straggler tasks

Some parameter combinations took 10x longer than others (e.g., high-frequency strategies with many trades). This created "stragglers"—the last 5% of tasks took 30% of total time.

Solution: Sort tasks by estimated runtime (based on parameter complexity), schedule longest tasks first. This improved load balancing and reduced total time by 20%.

Problem 3: Memory leaks

Workers accumulated memory over time, eventually crashing. Debugging showed pandas DataFrames weren't being garbage collected properly.

Solution: Restart workers every 100 backtests. Ray makes this trivial—just kill and respawn the actor.

When Distributed Backtesting Isn't Worth It #

Distributed systems add complexity. Sometimes, a faster single machine is better.

Small Parameter Spaces (<100 combinations)#

For 100 backtests at 10 seconds each, that's 16.7 minutes on a single machine. Setting up a distributed cluster takes longer than just running it locally.

Rule of thumb: Distribute only if single-threaded time >1 hour.

Development and Debugging #

Distributed backtesting makes debugging 10x harder. Print statements don't work (output scattered across workers). Breakpoints don't work (can't attach debugger to remote workers).

For strategy development, use single-threaded backtesting. Only distribute for production parameter sweeps.

Cost Sensitivity #

Running 320 CPUs costs $1.20/hour. For a 4-hour job, that's $4.80. If you run this daily, it's $1,752/year.

For a large fund, this is trivial. For an individual trader, it's significant. Consider whether the speed improvement justifies the cost.

Production Lessons #

Lesson 1: Monitor Everything #

We track:

Tasks completed per minute
Worker CPU/memory usage
Task duration distribution (p50, p95, p99)
Failure rate

Dashboards (Grafana) show real-time progress. Alerts trigger if throughput drops below 500 backtests/minute (indicates worker failures).

Lesson 2: Graceful Degradation #

Workers fail. S3 throttles. Networks partition. The system must handle failures gracefully.

We implement:

Automatic task retry (up to 3 attempts)
Worker health checks (restart if unresponsive)
Partial results (save completed backtests even if some fail)

Lesson 3: Cost Optimization #

We use AWS spot instances (70% cheaper than on-demand). Spot instances can be terminated with 2-minute notice, but Ray handles this gracefully—just reschedules tasks to other workers.

For non-urgent jobs, spot instances reduce cost from $1.20/hour to $0.36/hour.

Conclusion: Speed as a Competitive Advantage #

Distributed backtesting transformed research velocity from days to minutes. Researchers iterate 60x faster, test more ideas, and find better strategies.

But it's not free. The infrastructure costs money, the engineering takes months, and debugging is harder. The decision depends on scale:

Distribute when:

Parameter sweeps >1,000 combinations
Walk-forward or Monte Carlo analysis
Multiple strategies to optimize
Research velocity is bottleneck

Don't distribute when:

Small parameter spaces (<100 combinations)
Development/debugging phase
Cost-sensitive (individual traders)
Single-threaded time <1 hour

For the hedge fund, distributed backtesting was transformative. For an individual trader, a faster laptop might suffice. As always: measure, analyze, choose the right tool for your scale.