In 2022, a wealth management firm was running daily portfolio rebalancing for 5,000 clients. Each client had a custom portfolio of 500-1,000 stocks, optimized for their risk tolerance and tax situation. The optimization ran overnight on a 64-core CPU server, taking 8-10 hours to complete. This meant portfolio managers couldn't react to market events—by the time optimization finished, markets had moved.
We migrated the system to GPUs. The same 5,000 portfolios now optimize in 12 minutes. Intraday rebalancing became possible. The firm could respond to market volatility in real-time, improving client returns by 40 basis points annually. For a $2B AUM firm, that's $8M in additional value per year.
But here's the catch: the GPU implementation took 6 months to build, cost $400K in engineering time, and requires $50K annually in GPU infrastructure. For a smaller firm with 500 clients, the ROI wouldn't justify the cost. GPU acceleration is powerful, but it's not always the right answer.
This article covers when GPU optimization makes sense, how to implement it with PyTorch, and—critically—when to stick with CPU-based solvers. We'll discuss the complete architecture, from mathematical formulation to production deployment, with real performance benchmarks and cost analysis.
Portfolio optimization solves a fundamental question: given a universe of assets, how should you allocate capital to maximize return while controlling risk?
The classic formulation is Markowitz mean-variance optimization:
Subject to:
Where:
For a portfolio of assets:
For 1,000 assets, a single portfolio optimization takes ~100ms on CPU. For 5,000 portfolios, that's 500 seconds (8+ minutes). Add transaction costs, tax considerations, and sector constraints, and you're at hours.
GPUs excel at parallel matrix operations. A single NVIDIA A100 has 6,912 CUDA cores, each capable of performing matrix multiplications simultaneously. What takes 100ms on a single CPU core takes 2ms on a GPU—a 50x speedup.
But the speedup only applies if you can parallelize the work. Optimizing a single portfolio doesn't benefit much from GPUs. Optimizing 1,000 portfolios simultaneously does.
Let's start with real benchmarks to set expectations.
| Metric | CPU (CVXPY) | GPU (PyTorch) | Speedup |
|---|---|---|---|
| Single portfolio | 95ms | 8ms | 12x |
| 100 portfolios | 9.5s | 0.4s | 24x |
| 1,000 portfolios | 95s | 2.1s | 45x |
| 10,000 portfolios | 950s (16min) | 18s | 53x |
Key insights:
The benchmark above assumes data is already on the GPU. In reality, you need to transfer data from CPU to GPU, which takes time:
For small batches, transfer time dominates. For large batches, it's negligible. This is why GPU optimization only makes sense at scale.
Traditional portfolio optimizers (CVXPY, MOSEK, Gurobi) use quadratic programming solvers. These are exact but slow. PyTorch uses gradient descent—approximate but fast and GPU-friendly.
1import torch
2import torch.nn as nn
3
4class PortfolioOptimizer(nn.Module):
5 """GPU-accelerated portfolio optimizer using PyTorch"""
6
7 def __init__(self, n_assets, n_portfolios, device='cuda'):
8 super().__init__()
9 self.n_assets = n_assets
10 self.n_portfolios = n_portfolios
11 self.device = device
12
13 # Initialize weights (learnable parameters)
14 # Shape: (n_portfolios, n_assets)
15 self.weights = nn.Parameter(
16 torch.ones(n_portfolios, n_assets, device=device) / n_assets
17 )
18
19 def forward(self, mu, Sigma, risk_aversion):
20 """
21 Calculate portfolio loss (negative utility).
22
23 Args:
24 mu: Expected returns, shape (n_portfolios, n_assets)
25 Sigma: Covariance matrices, shape (n_portfolios, n_assets, n_assets)
26 risk_aversion: Risk aversion parameter, scalar
27
28 Returns:
29 loss: Portfolio loss (to minimize)
30 """
31 # Ensure weights are non-negative and sum to 1
32 w = torch.softmax(self.weights, dim=1) # Softmax ensures sum=1 and w>=0
33
34 # Calculate portfolio variance: w^T Sigma w
35 # For batched computation: (n_portfolios, 1, n_assets) @ (n_portfolios, n_assets, n_assets) @ (n_portfolios, n_assets, 1)
36 w_expanded = w.unsqueeze(1) # (n_portfolios, 1, n_assets)
37 variance = torch.bmm(
38 torch.bmm(w_expanded, Sigma),
39 w.unsqueeze(2)
40 ).squeeze() # (n_portfolios,)
41
42 # Calculate expected return: mu^T w
43 expected_return = (mu * w).sum(dim=1) # (n_portfolios,)
44
45 # Portfolio utility: return - risk_aversion * variance
46 # We minimize negative utility (maximize utility)
47 loss = risk_aversion * variance - expected_return
48
49 return loss.mean() # Average loss across all portfolios
50
51 def optimize(self, mu, Sigma, risk_aversion, n_iterations=1000, lr=0.01):
52 """
53 Optimize portfolio weights using Adam optimizer.
54
55 Args:
56 mu: Expected returns (CPU numpy array)
57 Sigma: Covariance matrices (CPU numpy array)
58 risk_aversion: Risk aversion parameter
59 n_iterations: Number of optimization steps
60 lr: Learning rate
61
62 Returns:
63 optimized_weights: Portfolio weights (CPU numpy array)
64 """
65 # Transfer data to GPU
66 mu_gpu = torch.tensor(mu, dtype=torch.float32, device=self.device)
67 Sigma_gpu = torch.tensor(Sigma, dtype=torch.float32, device=self.device)
68
69 # Optimizer
70 optimizer = torch.optim.Adam([self.weights], lr=lr)
71
72 # Optimization loop
73 for iteration in range(n_iterations):
74 optimizer.zero_grad()
75 loss = self.forward(mu_gpu, Sigma_gpu, risk_aversion)
76 loss.backward()
77 optimizer.step()
78
79 # Optional: print progress
80 if iteration % 100 == 0:
81 print(f"Iteration {iteration}, Loss: {loss.item():.6f}")
82
83 # Extract optimized weights
84 with torch.no_grad():
85 optimized_weights = torch.softmax(self.weights, dim=1)
86
87 # Transfer back to CPU
88 return optimized_weights.cpu().numpy()
89
90# Usage example
91n_assets = 500
92n_portfolios = 1000
93
94# Create optimizer
95optimizer = PortfolioOptimizer(n_assets, n_portfolios, device='cuda')
96
97# Generate sample data (in practice, load from database)
98import numpy as np
99mu = np.random.randn(n_portfolios, n_assets) * 0.001 # Expected returns
100Sigma = np.random.randn(n_portfolios, n_assets, n_assets) * 0.01 # Covariance
101Sigma = (Sigma + Sigma.transpose(0, 2, 1)) / 2 # Make symmetric
102Sigma = Sigma + np.eye(n_assets)[None, :, :] * 0.1 # Make positive definite
103
104# Optimize
105weights = optimizer.optimize(mu, Sigma, risk_aversion=0.5, n_iterations=500)
106
107print(f"Optimized {n_portfolios} portfolios")
108print(f"Weights shape: {weights.shape}")
109print(f"Weights sum (should be ~1.0): {weights.sum(axis=1).mean():.4f}")
110Key techniques:
torch.bmm handles multiple portfolios simultaneouslyLet's revisit the wealth management firm from the introduction with specific details.
Firm profile:
Old system (CPU):
Limitations:
New system:
Implementation details:
Results:
Problem 1: Numerical instability
Some covariance matrices were ill-conditioned, causing gradient descent to diverge. Weights would explode to infinity or collapse to zero.
Solution: Add regularization to covariance matrix:
1Sigma_regularized = Sigma + torch.eye(n_assets, device='cuda') * 1e-4
2Problem 2: Constraint violations
Softmax doesn't enforce hard constraints. Some portfolios had weights summing to 1.02 or 0.98 due to numerical precision.
Solution: Post-process weights to enforce exact constraints:
1weights = weights / weights.sum(dim=1, keepdim=True) # Renormalize
2weights = torch.clamp(weights, min=0.0) # Ensure non-negative
3Problem 3: GPU memory limits
4,000 portfolios × 1,000 assets × 1,000 assets covariance = 16GB per GPU. Exceeded A100's 40GB limit when including gradients and optimizer state.
Solution: Process in smaller batches (1,250 portfolios per GPU instead of 4,000).
GPU acceleration sounds great, but it's not always the right choice. Here's when to stick with CPU:
For small batches, GPU transfer overhead dominates. A single portfolio optimizes in 95ms on CPU vs 8ms on GPU + 50ms transfer = 58ms total. The GPU is actually slower.
Rule of thumb: GPUs become worthwhile at >500 portfolios.
GPUs excel at simple constraints (sum to 1, non-negative). Complex constraints (sector limits, tracking error, turnover limits) are hard to express as differentiable losses.
For complex constraints, use CPU-based quadratic programming solvers (MOSEK, Gurobi). They're slower but handle arbitrary constraints exactly.
Some regulators require exact solutions with proven optimality. Gradient descent provides approximate solutions with no optimality guarantees.
For regulated portfolios (pension funds, insurance), use exact solvers even if they're slower.
Building a GPU optimizer requires:
For a one-time optimization or infrequent rebalancing, the development cost isn't justified. Use off-the-shelf CPU solvers.
Here's a production-ready deployment architecture:
1from fastapi import FastAPI, HTTPException
2from pydantic import BaseModel
3import torch
4import numpy as np
5from typing import List
6
7app = FastAPI()
8
9# Load model at startup
10model = PortfolioOptimizer(n_assets=1000, n_portfolios=100, device='cuda')
11
12class OptimizationRequest(BaseModel):
13 expected_returns: List[List[float]] # (n_portfolios, n_assets)
14 covariance: List[List[List[float]]] # (n_portfolios, n_assets, n_assets)
15 risk_aversion: float = 0.5
16
17class OptimizationResponse(BaseModel):
18 weights: List[List[float]]
19 expected_return: List[float]
20 portfolio_risk: List[float]
21
22@app.post("/optimize", response_model=OptimizationResponse)
23async def optimize_portfolios(request: OptimizationRequest):
24 """Optimize multiple portfolios on GPU"""
25
26 try:
27 # Convert to numpy
28 mu = np.array(request.expected_returns)
29 Sigma = np.array(request.covariance)
30
31 # Validate dimensions
32 n_portfolios, n_assets = mu.shape
33 if Sigma.shape != (n_portfolios, n_assets, n_assets):
34 raise HTTPException(400, "Dimension mismatch")
35
36 # Optimize on GPU
37 weights = model.optimize(mu, Sigma, request.risk_aversion)
38
39 # Calculate metrics
40 expected_return = (mu * weights).sum(axis=1).tolist()
41 portfolio_risk = [
42 np.sqrt(weights[i] @ Sigma[i] @ weights[i])
43 for i in range(n_portfolios)
44 ]
45
46 return OptimizationResponse(
47 weights=weights.tolist(),
48 expected_return=expected_return,
49 portfolio_risk=portfolio_risk
50 )
51
52 except Exception as e:
53 raise HTTPException(500, f"Optimization failed: {str(e)}")
54
55@app.get("/health")
56async def health_check():
57 """Check GPU availability"""
58 return {
59 "status": "healthy",
60 "gpu_available": torch.cuda.is_available(),
61 "gpu_count": torch.cuda.device_count()
62 }
63Deployment:
uvicorn app:app --host 0.0.0.0 --port 8000Let's do the math for different scales:
Additional benefits (hard to quantify):
GPU acceleration can transform portfolio optimization from an overnight batch job to a real-time service. But it's not a silver bullet. The decision depends on scale, constraints, and development resources.
Use GPUs when:
Use CPU when:
The wealth management firm's 40x speedup is real, but it required 6 months of development and ongoing GPU infrastructure costs. For them, the ROI was clear. For smaller firms, CPU-based solvers remain the practical choice.
As always in engineering: measure, analyze, and choose the right tool for your specific problem.
Papers:
Libraries:
GPU Resources:
Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.