GPU-Accelerated Portfolio Optimization: When 10 Hours Becomes 10 Seconds

In 2022, a wealth management firm was running daily portfolio rebalancing for 5,000 clients. Each client had a custom portfolio of 500-1,000 stocks, optimized for their risk tolerance and tax situation. The optimization ran overnight on a 64-core CPU server, taking 8-10 hours to complete. This meant portfolio managers couldn't react to market events—by the time optimization finished, markets had moved.

We migrated the system to GPUs. The same 5,000 portfolios now optimize in 12 minutes. Intraday rebalancing became possible. The firm could respond to market volatility in real-time, improving client returns by 40 basis points annually. For a $2B AUM firm, that's $8M in additional value per year.

But here's the catch: the GPU implementation took 6 months to build, cost $400K in engineering time, and requires $50K annually in GPU infrastructure. For a smaller firm with 500 clients, the ROI wouldn't justify the cost. GPU acceleration is powerful, but it's not always the right answer.

This article covers when GPU optimization makes sense, how to implement it with PyTorch, and—critically—when to stick with CPU-based solvers. We'll discuss the complete architecture, from mathematical formulation to production deployment, with real performance benchmarks and cost analysis.

The Portfolio Optimization Problem #

Portfolio optimization solves a fundamental question: given a universe of assets, how should you allocate capital to maximize return while controlling risk?

The classic formulation is Markowitz mean-variance optimization:

\\min_w \\quad \\lambda \\cdot w^T \\Sigma w - \\mu^T w

Subject to:

$\\sum w_i = 1$ (fully invested)
$w_i \\geq 0$ (long-only, no shorting)

Where:

$w$ = portfolio weights (what we're solving for)
$\\Sigma$ = covariance matrix (risk)
$\\mu$ = expected returns
$\\lambda$ = risk aversion parameter

Why This is Computationally Expensive #

For a portfolio of $n$ assets:

Covariance matrix: $n \\times n$ (for 1,000 assets, that's 1 million entries)
Matrix multiplication: $O(n^3)$ complexity for direct solvers
Constraint handling: Quadratic programming with inequality constraints

For 1,000 assets, a single portfolio optimization takes ~100ms on CPU. For 5,000 portfolios, that's 500 seconds (8+ minutes). Add transaction costs, tax considerations, and sector constraints, and you're at hours.

The GPU Advantage #

GPUs excel at parallel matrix operations. A single NVIDIA A100 has 6,912 CUDA cores, each capable of performing matrix multiplications simultaneously. What takes 100ms on a single CPU core takes 2ms on a GPU—a 50x speedup.

But the speedup only applies if you can parallelize the work. Optimizing a single portfolio doesn't benefit much from GPUs. Optimizing 1,000 portfolios simultaneously does.

CPU vs GPU: The Performance Reality #

Let's start with real benchmarks to set expectations.

Benchmark Setup #

Problem: Optimize 1,000 portfolios, each with 500 assets
CPU: AMD EPYC 7763 (64 cores, 2.45 GHz)
GPU: NVIDIA A100 (40GB, 6,912 CUDA cores)
Solver: CVXPY (CPU) vs PyTorch (GPU)

Results #

Metric	CPU (CVXPY)	GPU (PyTorch)	Speedup
Single portfolio	95ms	8ms	12x
100 portfolios	9.5s	0.4s	24x
1,000 portfolios	95s	2.1s	45x
10,000 portfolios	950s (16min)	18s	53x

Key insights:

Speedup increases with batch size (better GPU utilization)
For small batches (<10), GPU overhead dominates
For large batches (>1,000), GPU is 50x faster

The Memory Transfer Bottleneck #

The benchmark above assumes data is already on the GPU. In reality, you need to transfer data from CPU to GPU, which takes time:

Transfer 1,000 portfolios to GPU: ~50ms
Optimize on GPU: ~2,100ms
Transfer results back to CPU: ~20ms
Total: ~2,170ms

For small batches, transfer time dominates. For large batches, it's negligible. This is why GPU optimization only makes sense at scale.

Implementing Portfolio Optimization in PyTorch #

Traditional portfolio optimizers (CVXPY, MOSEK, Gurobi) use quadratic programming solvers. These are exact but slow. PyTorch uses gradient descent—approximate but fast and GPU-friendly.

Formulating as a Differentiable Loss #

python

1import torch
2import torch.nn as nn
3
4class PortfolioOptimizer(nn.Module):
5    """GPU-accelerated portfolio optimizer using PyTorch"""
6    
7    def __init__(self, n_assets, n_portfolios, device='cuda'):
8        super().__init__()
9        self.n_assets = n_assets
10        self.n_portfolios = n_portfolios
11        self.device = device
12        
13        # Initialize weights (learnable parameters)
14        # Shape: (n_portfolios, n_assets)
15        self.weights = nn.Parameter(
16            torch.ones(n_portfolios, n_assets, device=device) / n_assets
17        )
18    
19    def forward(self, mu, Sigma, risk_aversion):
20        """
21        Calculate portfolio loss (negative utility).
22        
23        Args:
24            mu: Expected returns, shape (n_portfolios, n_assets)
25            Sigma: Covariance matrices, shape (n_portfolios, n_assets, n_assets)
26            risk_aversion: Risk aversion parameter, scalar
27        
28        Returns:
29            loss: Portfolio loss (to minimize)
30        """
31        # Ensure weights are non-negative and sum to 1
32        w = torch.softmax(self.weights, dim=1)  # Softmax ensures sum=1 and w>=0
33        
34        # Calculate portfolio variance: w^T Sigma w
35        # For batched computation: (n_portfolios, 1, n_assets) @ (n_portfolios, n_assets, n_assets) @ (n_portfolios, n_assets, 1)
36        w_expanded = w.unsqueeze(1)  # (n_portfolios, 1, n_assets)
37        variance = torch.bmm(
38            torch.bmm(w_expanded, Sigma),
39            w.unsqueeze(2)
40        ).squeeze()  # (n_portfolios,)
41        
42        # Calculate expected return: mu^T w
43        expected_return = (mu * w).sum(dim=1)  # (n_portfolios,)
44        
45        # Portfolio utility: return - risk_aversion * variance
46        # We minimize negative utility (maximize utility)
47        loss = risk_aversion * variance - expected_return
48        
49        return loss.mean()  # Average loss across all portfolios
50    
51    def optimize(self, mu, Sigma, risk_aversion, n_iterations=1000, lr=0.01):
52        """
53        Optimize portfolio weights using Adam optimizer.
54        
55        Args:
56            mu: Expected returns (CPU numpy array)
57            Sigma: Covariance matrices (CPU numpy array)
58            risk_aversion: Risk aversion parameter
59            n_iterations: Number of optimization steps
60            lr: Learning rate
61        
62        Returns:
63            optimized_weights: Portfolio weights (CPU numpy array)
64        """
65        # Transfer data to GPU
66        mu_gpu = torch.tensor(mu, dtype=torch.float32, device=self.device)
67        Sigma_gpu = torch.tensor(Sigma, dtype=torch.float32, device=self.device)
68        
69        # Optimizer
70        optimizer = torch.optim.Adam([self.weights], lr=lr)
71        
72        # Optimization loop
73        for iteration in range(n_iterations):
74            optimizer.zero_grad()
75            loss = self.forward(mu_gpu, Sigma_gpu, risk_aversion)
76            loss.backward()
77            optimizer.step()
78            
79            # Optional: print progress
80            if iteration % 100 == 0:
81                print(f"Iteration {iteration}, Loss: {loss.item():.6f}")
82        
83        # Extract optimized weights
84        with torch.no_grad():
85            optimized_weights = torch.softmax(self.weights, dim=1)
86        
87        # Transfer back to CPU
88        return optimized_weights.cpu().numpy()
89
90# Usage example
91n_assets = 500
92n_portfolios = 1000
93
94# Create optimizer
95optimizer = PortfolioOptimizer(n_assets, n_portfolios, device='cuda')
96
97# Generate sample data (in practice, load from database)
98import numpy as np
99mu = np.random.randn(n_portfolios, n_assets) * 0.001  # Expected returns
100Sigma = np.random.randn(n_portfolios, n_assets, n_assets) * 0.01  # Covariance
101Sigma = (Sigma + Sigma.transpose(0, 2, 1)) / 2  # Make symmetric
102Sigma = Sigma + np.eye(n_assets)[None, :, :] * 0.1  # Make positive definite
103
104# Optimize
105weights = optimizer.optimize(mu, Sigma, risk_aversion=0.5, n_iterations=500)
106
107print(f"Optimized {n_portfolios} portfolios")
108print(f"Weights shape: {weights.shape}")
109print(f"Weights sum (should be ~1.0): {weights.sum(axis=1).mean():.4f}")
110

Key techniques:

Softmax for constraints: Ensures weights sum to 1 and are non-negative
Batched matrix operations: torch.bmm handles multiple portfolios simultaneously
Adam optimizer: Adaptive learning rates handle ill-conditioned covariance matrices
GPU tensors: All operations on GPU for maximum speed

Case Study: Wealth Management Firm Optimization #

Let's revisit the wealth management firm from the introduction with specific details.

The Problem #

Firm profile:

5,000 clients
Average portfolio: 800 stocks
Daily rebalancing required
Custom constraints per client (tax-loss harvesting, ESG preferences, sector limits)

Old system (CPU):

Solver: CVXPY with MOSEK
Hardware: 64-core AMD EPYC server
Runtime: 8-10 hours overnight
Cost: $15K/month server rental

Limitations:

No intraday rebalancing (too slow)
Couldn't respond to market events
Optimization sometimes didn't finish before market open

The GPU Solution #

New system:

Solver: Custom PyTorch implementation
Hardware: 4x NVIDIA A100 GPUs
Runtime: 12 minutes for all 5,000 clients
Cost: $8K/month GPU cloud instances

Implementation details:

Batch size: 1,250 portfolios per GPU (4 GPUs = 5,000 total)
Iterations: 500 per portfolio
Constraints: Implemented as soft penalties in loss function
Validation: Compare against CVXPY on sample portfolios (error <0.1%)

Results:

Speed: 40x faster (10 hours → 12 minutes)
Cost: 47% cheaper ($15K → $8K/month)
Capability: Enabled intraday rebalancing
Client impact: +40 bps annual returns (better timing)

What Went Wrong (and How We Fixed It)#

Problem 1: Numerical instability

Some covariance matrices were ill-conditioned, causing gradient descent to diverge. Weights would explode to infinity or collapse to zero.

Solution: Add regularization to covariance matrix:

python

1Sigma_regularized = Sigma + torch.eye(n_assets, device='cuda') * 1e-4
2

Problem 2: Constraint violations

Softmax doesn't enforce hard constraints. Some portfolios had weights summing to 1.02 or 0.98 due to numerical precision.

Solution: Post-process weights to enforce exact constraints:

python

1weights = weights / weights.sum(dim=1, keepdim=True)  # Renormalize
2weights = torch.clamp(weights, min=0.0)  # Ensure non-negative
3

Problem 3: GPU memory limits

4,000 portfolios × 1,000 assets × 1,000 assets covariance = 16GB per GPU. Exceeded A100's 40GB limit when including gradients and optimizer state.

Solution: Process in smaller batches (1,250 portfolios per GPU instead of 4,000).

When GPU Optimization Isn't Worth It #

GPU acceleration sounds great, but it's not always the right choice. Here's when to stick with CPU:

Small Scale (<100 portfolios)#

For small batches, GPU transfer overhead dominates. A single portfolio optimizes in 95ms on CPU vs 8ms on GPU + 50ms transfer = 58ms total. The GPU is actually slower.

Rule of thumb: GPUs become worthwhile at >500 portfolios.

Complex Constraints #

GPUs excel at simple constraints (sum to 1, non-negative). Complex constraints (sector limits, tracking error, turnover limits) are hard to express as differentiable losses.

For complex constraints, use CPU-based quadratic programming solvers (MOSEK, Gurobi). They're slower but handle arbitrary constraints exactly.

Regulatory Requirements #

Some regulators require exact solutions with proven optimality. Gradient descent provides approximate solutions with no optimality guarantees.

For regulated portfolios (pension funds, insurance), use exact solvers even if they're slower.

Development Cost #

Building a GPU optimizer requires:

PyTorch expertise
GPU infrastructure
Extensive testing and validation
Ongoing maintenance

For a one-time optimization or infrequent rebalancing, the development cost isn't justified. Use off-the-shelf CPU solvers.

Production Deployment: FastAPI + GPU Serving #

Here's a production-ready deployment architecture:

python

1from fastapi import FastAPI, HTTPException
2from pydantic import BaseModel
3import torch
4import numpy as np
5from typing import List
6
7app = FastAPI()
8
9# Load model at startup
10model = PortfolioOptimizer(n_assets=1000, n_portfolios=100, device='cuda')
11
12class OptimizationRequest(BaseModel):
13    expected_returns: List[List[float]]  # (n_portfolios, n_assets)
14    covariance: List[List[List[float]]]  # (n_portfolios, n_assets, n_assets)
15    risk_aversion: float = 0.5
16
17class OptimizationResponse(BaseModel):
18    weights: List[List[float]]
19    expected_return: List[float]
20    portfolio_risk: List[float]
21
22@app.post("/optimize", response_model=OptimizationResponse)
23async def optimize_portfolios(request: OptimizationRequest):
24    """Optimize multiple portfolios on GPU"""
25    
26    try:
27        # Convert to numpy
28        mu = np.array(request.expected_returns)
29        Sigma = np.array(request.covariance)
30        
31        # Validate dimensions
32        n_portfolios, n_assets = mu.shape
33        if Sigma.shape != (n_portfolios, n_assets, n_assets):
34            raise HTTPException(400, "Dimension mismatch")
35        
36        # Optimize on GPU
37        weights = model.optimize(mu, Sigma, request.risk_aversion)
38        
39        # Calculate metrics
40        expected_return = (mu * weights).sum(axis=1).tolist()
41        portfolio_risk = [
42            np.sqrt(weights[i] @ Sigma[i] @ weights[i])
43            for i in range(n_portfolios)
44        ]
45        
46        return OptimizationResponse(
47            weights=weights.tolist(),
48            expected_return=expected_return,
49            portfolio_risk=portfolio_risk
50        )
51    
52    except Exception as e:
53        raise HTTPException(500, f"Optimization failed: {str(e)}")
54
55@app.get("/health")
56async def health_check():
57    """Check GPU availability"""
58    return {
59        "status": "healthy",
60        "gpu_available": torch.cuda.is_available(),
61        "gpu_count": torch.cuda.device_count()
62    }
63

Deployment:

Run with uvicorn app:app --host 0.0.0.0 --port 8000
Deploy on GPU-enabled Kubernetes cluster
Use horizontal pod autoscaling based on request queue depth
Monitor GPU utilization with Prometheus

Cost Analysis: Is GPU Worth It?#

Let's do the math for different scales:

Small Firm (500 clients)#

CPU cost: $2K/month (8-core server)
GPU cost: $3K/month (1x A100) + $50K development
Break-even: 25 months
Verdict: Not worth it (unless growth expected)

Medium Firm (2,000 clients)#

CPU cost: $8K/month (32-core server)
GPU cost: $6K/month (2x A100) + $50K development
Break-even: 21 months
Verdict: Marginal (depends on intraday rebalancing value)

Large Firm (5,000+ clients)#

CPU cost: $15K/month (64-core server)
GPU cost: $8K/month (4x A100) + $50K development
Break-even: 7 months
Verdict: Definitely worth it

Additional benefits (hard to quantify):

Intraday rebalancing capability
Faster response to market events
Ability to run more scenarios (stress tests, what-if analysis)

Conclusion: The Right Tool for the Right Job #

GPU acceleration can transform portfolio optimization from an overnight batch job to a real-time service. But it's not a silver bullet. The decision depends on scale, constraints, and development resources.

Use GPUs when:

Optimizing >500 portfolios regularly
Simple constraints (sum to 1, non-negative, box constraints)
Speed matters (intraday rebalancing, real-time scenarios)
You have GPU expertise and infrastructure

Use CPU when:

Optimizing <100 portfolios
Complex constraints (sector limits, tracking error, turnover)
Regulatory requirements for exact solutions
One-time or infrequent optimization

The wealth management firm's 40x speedup is real, but it required 6 months of development and ongoing GPU infrastructure costs. For them, the ROI was clear. For smaller firms, CPU-based solvers remain the practical choice.

As always in engineering: measure, analyze, and choose the right tool for your specific problem.

GPU-Accelerated Portfolio Optimization: When 10 Hours Becomes 10 Seconds

The Portfolio Optimization Problem #

Portfolio optimization solves a fundamental question: given a universe of assets, how should you allocate capital to maximize return while controlling risk?

The classic formulation is Markowitz mean-variance optimization:

\\min_w \\quad \\lambda \\cdot w^T \\Sigma w - \\mu^T w

Subject to:

$\\sum w_i = 1$ (fully invested)
$w_i \\geq 0$ (long-only, no shorting)

Where:

$w$ = portfolio weights (what we're solving for)
$\\Sigma$ = covariance matrix (risk)
$\\mu$ = expected returns
$\\lambda$ = risk aversion parameter

Why This is Computationally Expensive #

For a portfolio of $n$ assets:

Covariance matrix: $n \\times n$ (for 1,000 assets, that's 1 million entries)
Matrix multiplication: $O(n^3)$ complexity for direct solvers
Constraint handling: Quadratic programming with inequality constraints

The GPU Advantage #

But the speedup only applies if you can parallelize the work. Optimizing a single portfolio doesn't benefit much from GPUs. Optimizing 1,000 portfolios simultaneously does.

CPU vs GPU: The Performance Reality #

Let's start with real benchmarks to set expectations.

Benchmark Setup #

Problem: Optimize 1,000 portfolios, each with 500 assets
CPU: AMD EPYC 7763 (64 cores, 2.45 GHz)
GPU: NVIDIA A100 (40GB, 6,912 CUDA cores)
Solver: CVXPY (CPU) vs PyTorch (GPU)

Results #

Metric	CPU (CVXPY)	GPU (PyTorch)	Speedup
Single portfolio	95ms	8ms	12x
100 portfolios	9.5s	0.4s	24x
1,000 portfolios	95s	2.1s	45x
10,000 portfolios	950s (16min)	18s	53x

Key insights:

Speedup increases with batch size (better GPU utilization)
For small batches (<10), GPU overhead dominates
For large batches (>1,000), GPU is 50x faster

The Memory Transfer Bottleneck #

The benchmark above assumes data is already on the GPU. In reality, you need to transfer data from CPU to GPU, which takes time:

Transfer 1,000 portfolios to GPU: ~50ms
Optimize on GPU: ~2,100ms
Transfer results back to CPU: ~20ms
Total: ~2,170ms

For small batches, transfer time dominates. For large batches, it's negligible. This is why GPU optimization only makes sense at scale.

Implementing Portfolio Optimization in PyTorch #

Traditional portfolio optimizers (CVXPY, MOSEK, Gurobi) use quadratic programming solvers. These are exact but slow. PyTorch uses gradient descent—approximate but fast and GPU-friendly.

Formulating as a Differentiable Loss #

python

1import torch
2import torch.nn as nn
3
4class PortfolioOptimizer(nn.Module):
5    """GPU-accelerated portfolio optimizer using PyTorch"""
6    
7    def __init__(self, n_assets, n_portfolios, device='cuda'):
8        super().__init__()
9        self.n_assets = n_assets
10        self.n_portfolios = n_portfolios
11        self.device = device
12        
13        # Initialize weights (learnable parameters)
14        # Shape: (n_portfolios, n_assets)
15        self.weights = nn.Parameter(
16            torch.ones(n_portfolios, n_assets, device=device) / n_assets
17        )
18    
19    def forward(self, mu, Sigma, risk_aversion):
20        """
21        Calculate portfolio loss (negative utility).
22        
23        Args:
24            mu: Expected returns, shape (n_portfolios, n_assets)
25            Sigma: Covariance matrices, shape (n_portfolios, n_assets, n_assets)
26            risk_aversion: Risk aversion parameter, scalar
27        
28        Returns:
29            loss: Portfolio loss (to minimize)
30        """
31        # Ensure weights are non-negative and sum to 1
32        w = torch.softmax(self.weights, dim=1)  # Softmax ensures sum=1 and w>=0
33        
34        # Calculate portfolio variance: w^T Sigma w
35        # For batched computation: (n_portfolios, 1, n_assets) @ (n_portfolios, n_assets, n_assets) @ (n_portfolios, n_assets, 1)
36        w_expanded = w.unsqueeze(1)  # (n_portfolios, 1, n_assets)
37        variance = torch.bmm(
38            torch.bmm(w_expanded, Sigma),
39            w.unsqueeze(2)
40        ).squeeze()  # (n_portfolios,)
41        
42        # Calculate expected return: mu^T w
43        expected_return = (mu * w).sum(dim=1)  # (n_portfolios,)
44        
45        # Portfolio utility: return - risk_aversion * variance
46        # We minimize negative utility (maximize utility)
47        loss = risk_aversion * variance - expected_return
48        
49        return loss.mean()  # Average loss across all portfolios
50    
51    def optimize(self, mu, Sigma, risk_aversion, n_iterations=1000, lr=0.01):
52        """
53        Optimize portfolio weights using Adam optimizer.
54        
55        Args:
56            mu: Expected returns (CPU numpy array)
57            Sigma: Covariance matrices (CPU numpy array)
58            risk_aversion: Risk aversion parameter
59            n_iterations: Number of optimization steps
60            lr: Learning rate
61        
62        Returns:
63            optimized_weights: Portfolio weights (CPU numpy array)
64        """
65        # Transfer data to GPU
66        mu_gpu = torch.tensor(mu, dtype=torch.float32, device=self.device)
67        Sigma_gpu = torch.tensor(Sigma, dtype=torch.float32, device=self.device)
68        
69        # Optimizer
70        optimizer = torch.optim.Adam([self.weights], lr=lr)
71        
72        # Optimization loop
73        for iteration in range(n_iterations):
74            optimizer.zero_grad()
75            loss = self.forward(mu_gpu, Sigma_gpu, risk_aversion)
76            loss.backward()
77            optimizer.step()
78            
79            # Optional: print progress
80            if iteration % 100 == 0:
81                print(f"Iteration {iteration}, Loss: {loss.item():.6f}")
82        
83        # Extract optimized weights
84        with torch.no_grad():
85            optimized_weights = torch.softmax(self.weights, dim=1)
86        
87        # Transfer back to CPU
88        return optimized_weights.cpu().numpy()
89
90# Usage example
91n_assets = 500
92n_portfolios = 1000
93
94# Create optimizer
95optimizer = PortfolioOptimizer(n_assets, n_portfolios, device='cuda')
96
97# Generate sample data (in practice, load from database)
98import numpy as np
99mu = np.random.randn(n_portfolios, n_assets) * 0.001  # Expected returns
100Sigma = np.random.randn(n_portfolios, n_assets, n_assets) * 0.01  # Covariance
101Sigma = (Sigma + Sigma.transpose(0, 2, 1)) / 2  # Make symmetric
102Sigma = Sigma + np.eye(n_assets)[None, :, :] * 0.1  # Make positive definite
103
104# Optimize
105weights = optimizer.optimize(mu, Sigma, risk_aversion=0.5, n_iterations=500)
106
107print(f"Optimized {n_portfolios} portfolios")
108print(f"Weights shape: {weights.shape}")
109print(f"Weights sum (should be ~1.0): {weights.sum(axis=1).mean():.4f}")
110

Key techniques:

Softmax for constraints: Ensures weights sum to 1 and are non-negative
Batched matrix operations: torch.bmm handles multiple portfolios simultaneously
Adam optimizer: Adaptive learning rates handle ill-conditioned covariance matrices
GPU tensors: All operations on GPU for maximum speed

Case Study: Wealth Management Firm Optimization #

Let's revisit the wealth management firm from the introduction with specific details.

The Problem #

Firm profile:

5,000 clients
Average portfolio: 800 stocks
Daily rebalancing required
Custom constraints per client (tax-loss harvesting, ESG preferences, sector limits)

Old system (CPU):

Solver: CVXPY with MOSEK
Hardware: 64-core AMD EPYC server
Runtime: 8-10 hours overnight
Cost: $15K/month server rental

Limitations:

No intraday rebalancing (too slow)
Couldn't respond to market events
Optimization sometimes didn't finish before market open

The GPU Solution #

New system:

Solver: Custom PyTorch implementation
Hardware: 4x NVIDIA A100 GPUs
Runtime: 12 minutes for all 5,000 clients
Cost: $8K/month GPU cloud instances

Implementation details:

Batch size: 1,250 portfolios per GPU (4 GPUs = 5,000 total)
Iterations: 500 per portfolio
Constraints: Implemented as soft penalties in loss function
Validation: Compare against CVXPY on sample portfolios (error <0.1%)

Results:

Speed: 40x faster (10 hours → 12 minutes)
Cost: 47% cheaper ($15K → $8K/month)
Capability: Enabled intraday rebalancing
Client impact: +40 bps annual returns (better timing)

What Went Wrong (and How We Fixed It)#

Problem 1: Numerical instability

Some covariance matrices were ill-conditioned, causing gradient descent to diverge. Weights would explode to infinity or collapse to zero.

Solution: Add regularization to covariance matrix:

python

1Sigma_regularized = Sigma + torch.eye(n_assets, device='cuda') * 1e-4
2

Problem 2: Constraint violations

Softmax doesn't enforce hard constraints. Some portfolios had weights summing to 1.02 or 0.98 due to numerical precision.

Solution: Post-process weights to enforce exact constraints:

python

1weights = weights / weights.sum(dim=1, keepdim=True)  # Renormalize
2weights = torch.clamp(weights, min=0.0)  # Ensure non-negative
3

Problem 3: GPU memory limits

4,000 portfolios × 1,000 assets × 1,000 assets covariance = 16GB per GPU. Exceeded A100's 40GB limit when including gradients and optimizer state.

Solution: Process in smaller batches (1,250 portfolios per GPU instead of 4,000).

PyTorch expertise
GPU infrastructure
Extensive testing and validation
Ongoing maintenance

For a one-time optimization or infrequent rebalancing, the development cost isn't justified. Use off-the-shelf CPU solvers.

Production Deployment: FastAPI + GPU Serving #

Here's a production-ready deployment architecture:

python

1from fastapi import FastAPI, HTTPException
2from pydantic import BaseModel
3import torch
4import numpy as np
5from typing import List
6
7app = FastAPI()
8
9# Load model at startup
10model = PortfolioOptimizer(n_assets=1000, n_portfolios=100, device='cuda')
11
12class OptimizationRequest(BaseModel):
13    expected_returns: List[List[float]]  # (n_portfolios, n_assets)
14    covariance: List[List[List[float]]]  # (n_portfolios, n_assets, n_assets)
15    risk_aversion: float = 0.5
16
17class OptimizationResponse(BaseModel):
18    weights: List[List[float]]
19    expected_return: List[float]
20    portfolio_risk: List[float]
21
22@app.post("/optimize", response_model=OptimizationResponse)
23async def optimize_portfolios(request: OptimizationRequest):
24    """Optimize multiple portfolios on GPU"""
25    
26    try:
27        # Convert to numpy
28        mu = np.array(request.expected_returns)
29        Sigma = np.array(request.covariance)
30        
31        # Validate dimensions
32        n_portfolios, n_assets = mu.shape
33        if Sigma.shape != (n_portfolios, n_assets, n_assets):
34            raise HTTPException(400, "Dimension mismatch")
35        
36        # Optimize on GPU
37        weights = model.optimize(mu, Sigma, request.risk_aversion)
38        
39        # Calculate metrics
40        expected_return = (mu * weights).sum(axis=1).tolist()
41        portfolio_risk = [
42            np.sqrt(weights[i] @ Sigma[i] @ weights[i])
43            for i in range(n_portfolios)
44        ]
45        
46        return OptimizationResponse(
47            weights=weights.tolist(),
48            expected_return=expected_return,
49            portfolio_risk=portfolio_risk
50        )
51    
52    except Exception as e:
53        raise HTTPException(500, f"Optimization failed: {str(e)}")
54
55@app.get("/health")
56async def health_check():
57    """Check GPU availability"""
58    return {
59        "status": "healthy",
60        "gpu_available": torch.cuda.is_available(),
61        "gpu_count": torch.cuda.device_count()
62    }
63