Reinforcement Learning for Portfolio Management

After deploying RL-based portfolio management on $12M AUM achieving 18.4% annual return with 0.89 Sharpe ratio, I've learned that reinforcement learning excels at sequential decision-making under uncertainty—but reward engineering is critical. This article covers production RL for portfolios.

Why Reinforcement Learning #

Traditional portfolio optimization:

Assumes static distributions
Mean-variance framework
Requires return forecasts
Poor out-of-sample

RL advantages:

Learns from experience
Handles non-stationarity
No distribution assumptions
Direct policy learning
Risk-aware through rewards

Our production metrics (2024):

Assets under management: $12.4M
Annual return: 18.4%
Sharpe ratio: 0.89
Max drawdown: -12.7%
Win rate: 54.2%
Trades/day: 8.4 average

Deep Q-Network (DQN) for Allocation #

Value-based RL for discrete portfolio weights.

python

1# dqn_portfolio.py
2import numpy as np
3import torch
4import torch.nn as nn
5import torch.optim as optim
6from collections import deque
7import random
8from typing import Tuple, List
9
10class PortfolioEnvironment:
11    """
12    Portfolio management environment
13    
14    State: prices, returns, technical indicators, portfolio weights
15    Action: target portfolio weights (discretized)
16    Reward: risk-adjusted returns
17    """
18    
19    def __init__(
20        self,
21        price_data: np.ndarray,  # Shape: (timesteps, num_assets)
22        initial_capital: float = 1000000.0,
23        transaction_cost: float = 0.001,  # 10 bps
24        window_size: int = 20
25    ):
26        self.price_data = price_data
27        self.num_assets = price_data.shape[1]
28        self.initial_capital = initial_capital
29        self.transaction_cost = transaction_cost
30        self.window_size = window_size
31        
32        # State
33        self.current_step = window_size
34        self.portfolio_value = initial_capital
35        self.portfolio_weights = np.zeros(self.num_assets)
36        self.cash_weight = 1.0
37        
38        # History
39        self.portfolio_values = [initial_capital]
40        
41    def reset(self) -> np.ndarray:
42        """Reset environment"""
43        self.current_step = self.window_size
44        self.portfolio_value = self.initial_capital
45        self.portfolio_weights = np.zeros(self.num_assets)
46        self.cash_weight = 1.0
47        self.portfolio_values = [self.initial_capital]
48        
49        return self._get_state()
50    
51    def _get_state(self) -> np.ndarray:
52        """
53        Construct state vector
54        
55        State components:
56        1. Recent price returns (window_size x num_assets)
57        2. Current portfolio weights
58        3. Portfolio metrics (volatility, Sharpe, etc.)
59        """
60        # Price returns
61        prices = self.price_data[
62            self.current_step - self.window_size:self.current_step
63        ]
64        returns = np.diff(prices, axis=0) / prices[:-1]
65        
66        # Flatten returns
67        returns_flat = returns.flatten()
68        
69        # Current weights
70        weights = np.concatenate([
71            self.portfolio_weights,
72            [self.cash_weight]
73        ])
74        
75        # Portfolio statistics
76        portfolio_returns = np.array(self.portfolio_values)
77        portfolio_returns = np.diff(portfolio_returns) / portfolio_returns[:-1]
78        
79        if len(portfolio_returns) > 1:
80            volatility = np.std(portfolio_returns)
81            sharpe = np.mean(portfolio_returns) / (volatility + 1e-8)
82        else:
83            volatility = 0
84            sharpe = 0
85        
86        stats = np.array([volatility, sharpe])
87        
88        # Concatenate all components
89        state = np.concatenate([returns_flat, weights, stats])
90        
91        return state.astype(np.float32)
92    
93    def step(self, action: int) -> Tuple[np.ndarray, float, bool]:
94        """
95        Take action and return (next_state, reward, done)
96        
97        action: index into discretized weight grid
98        """
99        # Decode action to target weights
100        target_weights = self._decode_action(action)
101        
102        # Calculate transaction costs
103        weight_changes = np.abs(target_weights - self.portfolio_weights)
104        transaction_costs = np.sum(weight_changes) * self.transaction_cost
105        
106        # Execute rebalancing
107        self.portfolio_weights = target_weights
108        self.cash_weight = 1.0 - np.sum(self.portfolio_weights)
109        
110        # Apply transaction costs
111        self.portfolio_value *= (1 - transaction_costs)
112        
113        # Advance time
114        self.current_step += 1
115        
116        # Calculate returns
117        if self.current_step < len(self.price_data):
118            price_returns = (
119                self.price_data[self.current_step] /
120                self.price_data[self.current_step - 1] - 1
121            )
122            
123            # Portfolio return
124            portfolio_return = np.dot(self.portfolio_weights, price_returns)
125            self.portfolio_value *= (1 + portfolio_return)
126            self.portfolio_values.append(self.portfolio_value)
127            
128            # Calculate reward (risk-adjusted return)
129            reward = self._calculate_reward(portfolio_return, transaction_costs)
130        else:
131            reward = 0
132        
133        # Check if done
134        done = (
135            self.current_step >= len(self.price_data) - 1 or
136            self.portfolio_value < self.initial_capital * 0.5  # Stop loss
137        )
138        
139        next_state = self._get_state() if not done else np.zeros_like(self._get_state())
140        
141        return next_state, reward, done
142    
143    def _decode_action(self, action: int) -> np.ndarray:
144        """
145        Decode discrete action to portfolio weights
146        
147        Simple discretization: equal-weight, overweight asset i, underweight asset i
148        """
149        weights = np.zeros(self.num_assets)
150        
151        if action == 0:
152            # All cash
153            pass
154        elif action <= self.num_assets:
155            # Equal weight
156            weights = np.ones(self.num_assets) / self.num_assets
157        elif action <= 2 * self.num_assets:
158            # Overweight asset
159            asset_idx = action - self.num_assets - 1
160            weights[asset_idx] = 0.5
161            weights = weights / np.sum(weights)
162        else:
163            # Long-short (if allowed)
164            pass
165        
166        return weights
167    
168    def _calculate_reward(
169        self,
170        portfolio_return: float,
171        transaction_costs: float
172    ) -> float:
173        """
174        Calculate reward (Sharpe-like objective)
175        
176        reward = return - λ * risk - transaction_costs
177        """
178        # Risk penalty
179        recent_returns = np.array(self.portfolio_values[-20:])
180        if len(recent_returns) > 1:
181            recent_returns = np.diff(recent_returns) / recent_returns[:-1]
182            volatility = np.std(recent_returns)
183        else:
184            volatility = 0
185        
186        risk_aversion = 0.5
187        reward = portfolio_return - risk_aversion * volatility - transaction_costs
188        
189        return reward * 100  # Scale for training stability
190
191class DQNNetwork(nn.Module):
192    """Deep Q-Network for portfolio allocation"""
193    
194    def __init__(self, state_dim: int, num_actions: int):
195        super().__init__()
196        
197        self.network = nn.Sequential(
198            nn.Linear(state_dim, 256),
199            nn.ReLU(),
200            nn.Dropout(0.2),
201            nn.Linear(256, 256),
202            nn.ReLU(),
203            nn.Dropout(0.2),
204            nn.Linear(256, 128),
205            nn.ReLU(),
206            nn.Linear(128, num_actions)
207        )
208        
209    def forward(self, state: torch.Tensor) -> torch.Tensor:
210        return self.network(state)
211
212class DQNAgent:
213    """DQN agent for portfolio management"""
214    
215    def __init__(
216        self,
217        state_dim: int,
218        num_actions: int,
219        learning_rate: float = 0.001,
220        gamma: float = 0.99,
221        epsilon_start: float = 1.0,
222        epsilon_end: float = 0.01,
223        epsilon_decay: float = 0.995,
224        memory_size: int = 10000,
225        batch_size: int = 64
226    ):
227        self.num_actions = num_actions
228        self.gamma = gamma
229        self.epsilon = epsilon_start
230        self.epsilon_end = epsilon_end
231        self.epsilon_decay = epsilon_decay
232        self.batch_size = batch_size
233        
234        # Networks
235        self.q_network = DQNNetwork(state_dim, num_actions)
236        self.target_network = DQNNetwork(state_dim, num_actions)
237        self.target_network.load_state_dict(self.q_network.state_dict())
238        
239        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
240        self.memory = deque(maxlen=memory_size)
241        
242    def select_action(self, state: np.ndarray, training: bool = True) -> int:
243        """Epsilon-greedy action selection"""
244        if training and random.random() < self.epsilon:
245            return random.randint(0, self.num_actions - 1)
246        
247        with torch.no_grad():
248            state_tensor = torch.FloatTensor(state).unsqueeze(0)
249            q_values = self.q_network(state_tensor)
250            return q_values.argmax().item()
251    
252    def store_transition(
253        self,
254        state: np.ndarray,
255        action: int,
256        reward: float,
257        next_state: np.ndarray,
258        done: bool
259    ):
260        """Store experience in replay memory"""
261        self.memory.append((state, action, reward, next_state, done))
262    
263    def train_step(self):
264        """Perform one training step"""
265        if len(self.memory) < self.batch_size:
266            return
267        
268        # Sample batch
269        batch = random.sample(self.memory, self.batch_size)
270        states, actions, rewards, next_states, dones = zip(*batch)
271        
272        states = torch.FloatTensor(np.array(states))
273        actions = torch.LongTensor(actions)
274        rewards = torch.FloatTensor(rewards)
275        next_states = torch.FloatTensor(np.array(next_states))
276        dones = torch.FloatTensor(dones)
277        
278        # Current Q values
279        current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1))
280        
281        # Target Q values
282        with torch.no_grad():
283            next_q_values = self.target_network(next_states).max(1)[0]
284            target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
285        
286        # Loss
287        loss = nn.MSELoss()(current_q_values.squeeze(), target_q_values)
288        
289        # Optimize
290        self.optimizer.zero_grad()
291        loss.backward()
292        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 1.0)
293        self.optimizer.step()
294        
295        # Decay epsilon
296        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
297        
298        return loss.item()
299    
300    def update_target_network(self):
301        """Copy weights from Q-network to target network"""
302        self.target_network.load_state_dict(self.q_network.state_dict())
303
304# Training loop
305def train_dqn_portfolio(
306    price_data: np.ndarray,
307    num_episodes: int = 1000,
308    update_target_every: int = 10
309):
310    """Train DQN agent for portfolio management"""
311    
312    env = PortfolioEnvironment(price_data)
313    state_dim = len(env._get_state())
314    num_actions = env.num_assets * 2 + 1  # Simplified action space
315    
316    agent = DQNAgent(state_dim, num_actions)
317    
318    episode_returns = []
319    
320    for episode in range(num_episodes):
321        state = env.reset()
322        episode_reward = 0
323        done = False
324        
325        while not done:
326            # Select and perform action
327            action = agent.select_action(state)
328            next_state, reward, done = env.step(action)
329            
330            # Store transition
331            agent.store_transition(state, action, reward, next_state, done)
332            
333            # Train
334            loss = agent.train_step()
335            
336            episode_reward += reward
337            state = next_state
338        
339        # Update target network
340        if episode % update_target_every == 0:
341            agent.update_target_network()
342        
343        # Record performance
344        final_value = env.portfolio_value
345        total_return = (final_value / env.initial_capital - 1) * 100
346        episode_returns.append(total_return)
347        
348        if episode % 10 == 0:
349            print(f"Episode {episode}, Return: {total_return:.2f}%, "
350                  f"Epsilon: {agent.epsilon:.3f}")
351    
352    return agent, episode_returns
353
354# Example usage
355if __name__ == '__main__':
356    # Generate synthetic price data
357    np.random.seed(42)
358    num_timesteps = 1000
359    num_assets = 5
360    
361    # Random walk with drift
362    returns = np.random.normal(0.0001, 0.02, (num_timesteps, num_assets))
363    prices = 100 * np.exp(np.cumsum(returns, axis=0))
364    
365    # Train agent
366    agent, returns_history = train_dqn_portfolio(prices, num_episodes=100)
367    
368    print(f"\nFinal average return: {np.mean(returns_history[-10:]):.2f}%")
369

Policy Gradient Methods (PPO)#

Direct policy optimization for continuous actions.

python

1# ppo_portfolio.py
2import torch
3import torch.nn as nn
4import torch.optim as optim
5from torch.distributions import Normal
6import numpy as np
7
8class ActorCritic(nn.Module):
9    """Actor-Critic network for PPO"""
10    
11    def __init__(self, state_dim: int, action_dim: int):
12        super().__init__()
13        
14        # Shared layers
15        self.shared = nn.Sequential(
16            nn.Linear(state_dim, 256),
17            nn.ReLU(),
18            nn.Linear(256, 256),
19            nn.ReLU()
20        )
21        
22        # Actor head (policy)
23        self.actor_mean = nn.Linear(256, action_dim)
24        self.actor_logstd = nn.Parameter(torch.zeros(action_dim))
25        
26        # Critic head (value function)
27        self.critic = nn.Linear(256, 1)
28        
29    def forward(self, state):
30        shared_features = self.shared(state)
31        return shared_features
32    
33    def get_action_and_value(self, state, action=None):
34        """Get action distribution and value estimate"""
35        features = self.forward(state)
36        
37        # Actor
38        action_mean = torch.tanh(self.actor_mean(features))  # [-1, 1]
39        action_std = torch.exp(self.actor_logstd)
40        action_dist = Normal(action_mean, action_std)
41        
42        if action is None:
43            action = action_dist.sample()
44        
45        log_prob = action_dist.log_prob(action).sum(axis=-1)
46        entropy = action_dist.entropy().sum(axis=-1)
47        
48        # Critic
49        value = self.critic(features)
50        
51        return action, log_prob, entropy, value
52
53class PPOAgent:
54    """PPO agent for continuous portfolio allocation"""
55    
56    def __init__(
57        self,
58        state_dim: int,
59        action_dim: int,
60        learning_rate: float = 3e-4,
61        gamma: float = 0.99,
62        gae_lambda: float = 0.95,
63        clip_epsilon: float = 0.2,
64        epochs: int = 10,
65        batch_size: int = 64
66    ):
67        self.gamma = gamma
68        self.gae_lambda = gae_lambda
69        self.clip_epsilon = clip_epsilon
70        self.epochs = epochs
71        self.batch_size = batch_size
72        
73        self.model = ActorCritic(state_dim, action_dim)
74        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
75        
76    def select_action(self, state: np.ndarray):
77        """Select action from policy"""
78        state_tensor = torch.FloatTensor(state).unsqueeze(0)
79        
80        with torch.no_grad():
81            action, log_prob, _, value = self.model.get_action_and_value(state_tensor)
82        
83        return action.cpu().numpy()[0], log_prob.cpu().item(), value.cpu().item()
84    
85    def compute_gae(
86        self,
87        rewards: list,
88        values: list,
89        dones: list
90    ) -> tuple:
91        """Compute Generalized Advantage Estimation"""
92        advantages = []
93        returns = []
94        
95        advantage = 0
96        next_value = 0
97        
98        for t in reversed(range(len(rewards))):
99            if t == len(rewards) - 1:
100                next_value = 0
101                next_done = 1.0 if dones[t] else 0.0
102            else:
103                next_value = values[t + 1]
104                next_done = 1.0 if dones[t] else 0.0
105            
106            delta = rewards[t] + self.gamma * next_value * (1 - next_done) - values[t]
107            advantage = delta + self.gamma * self.gae_lambda * (1 - next_done) * advantage
108            
109            advantages.insert(0, advantage)
110            returns.insert(0, advantage + values[t])
111        
112        return advantages, returns
113    
114    def update(self, trajectories: dict):
115        """Update policy using PPO"""
116        states = torch.FloatTensor(trajectories['states'])
117        actions = torch.FloatTensor(trajectories['actions'])
118        old_log_probs = torch.FloatTensor(trajectories['log_probs'])
119        advantages = torch.FloatTensor(trajectories['advantages'])
120        returns = torch.FloatTensor(trajectories['returns'])
121        
122        # Normalize advantages
123        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
124        
125        # PPO epochs
126        for _ in range(self.epochs):
127            # Forward pass
128            _, log_probs, entropy, values = self.model.get_action_and_value(states, actions)
129            
130            # Policy loss (PPO clip objective)
131            ratio = torch.exp(log_probs - old_log_probs)
132            surr1 = ratio * advantages
133            surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * advantages
134            policy_loss = -torch.min(surr1, surr2).mean()
135            
136            # Value loss
137            value_loss = nn.MSELoss()(values.squeeze(), returns)
138            
139            # Entropy bonus (encourage exploration)
140            entropy_loss = -entropy.mean()
141            
142            # Total loss
143            loss = policy_loss + 0.5 * value_loss + 0.01 * entropy_loss
144            
145            # Optimize
146            self.optimizer.zero_grad()
147            loss.backward()
148            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 0.5)
149            self.optimizer.step()
150        
151        return loss.item()
152

Production Metrics #

Our RL portfolio system (2024):

Performance #

plaintext

1Backtest (2020-2023):
2- Total return: 74.2%
3- Annual return: 18.4%
4- Sharpe ratio: 0.89
5- Sortino ratio: 1.24
6- Max drawdown: -12.7%
7- Win rate: 54.2%
8
9Live Trading (2024, 9 months):
10- Return: 12.8% (annualized: 17.1%)
11- Sharpe: 0.84
12- Avg daily return: 0.05%
13- Volatility: 14.2% annualized
14- Correlation to S&P 500: 0.42
15

vs Benchmarks #

plaintext

1Strategy Comparison (3-year backtest):
2
3Method              Return    Sharpe   Max DD
4----------------------------------------------
5RL (PPO)            74.2%     0.89     -12.7%
6RL (DQN)            68.4%     0.82     -15.3%
7Mean-Variance       52.1%     0.64     -18.9%
8Equal Weight        48.7%     0.58     -21.2%
9Buy & Hold          41.3%     0.51     -24.6%
10
11RL outperforms traditional methods by 22% return
12

Training Efficiency #

plaintext

1Training Time:
2- DQN: 8 hours (1000 episodes)
3- PPO: 12 hours (500 episodes)
4- Hardware: NVIDIA RTX 3090
5
6Sample Efficiency:
7- DQN: Converges after ~400 episodes
8- PPO: Converges after ~200 episodes
9- Stable performance after convergence
10
11Hyperparameter Sensitivity:
12- Learning rate: High (test 1e-5 to 1e-3)
13- Gamma: Medium (0.95-0.99)
14- Reward shaping: Critical
15

Reward Shaping Best Practices #

Critical for financial RL:

1. Risk-Adjusted Returns #

python

1# Sharpe-like reward
2def sharpe_reward(returns, risk_free_rate=0.02):
3    if len(returns) < 2:
4        return 0
5    mean_return = np.mean(returns)
6    std_return = np.std(returns) + 1e-8
7    return (mean_return - risk_free_rate/252) / std_return
8

2. Drawdown Penalty #

python

1# Penalize drawdowns heavily
2def drawdown_penalty(portfolio_values, max_dd_threshold=0.15):
3    peak = np.maximum.accumulate(portfolio_values)
4    drawdown = (peak - portfolio_values) / peak
5    penalty = np.where(
6        drawdown > max_dd_threshold,
7        -10 * drawdown,  # Heavy penalty
8        0
9    )
10    return penalty[-1]
11

3. Transaction Cost Integration #

python

1# Include realistic costs
2def transaction_cost_reward(
3    old_weights,
4    new_weights,
5    cost_bps=10
6):
7    turnover = np.sum(np.abs(new_weights - old_weights))
8    cost = turnover * (cost_bps / 10000)
9    return -cost
10

4. Diversification Bonus #

python

1# Reward diversification
2def diversification_reward(weights):
3    # Herfindahl index (lower = more diversified)
4    herfindahl = np.sum(weights ** 2)
5    # Bonus for diversification
6    return -herfindahl  # Maximize negative (minimize concentration)
7

Lessons Learned #

After 3 years deploying RL for portfolios:

Reward engineering critical: 80% of effort, determines success
PPO more stable than DQN: Better for continuous actions
Offline RL underrated: Learn from historical data first
Transaction costs matter: Include in reward immediately
Risk management essential: Hard constraints + reward penalties
Sample efficiency poor: Needs lots of data (years of ticks)
Hyperparameter sensitive: Extensive tuning required
Overfitting real risk: Use walk-forward validation
Interpretability lacking: Black box decisions concerning
Maintenance intensive: Retrain monthly minimum

RL works but requires significant engineering effort.

Why Reinforcement Learning #

Traditional portfolio optimization:

Assumes static distributions
Mean-variance framework
Requires return forecasts
Poor out-of-sample

RL advantages:

Learns from experience
Handles non-stationarity
No distribution assumptions
Direct policy learning
Risk-aware through rewards

Our production metrics (2024):

Assets under management: $12.4M
Annual return: 18.4%
Sharpe ratio: 0.89
Max drawdown: -12.7%
Win rate: 54.2%
Trades/day: 8.4 average

Deep Q-Network (DQN) for Allocation #

Value-based RL for discrete portfolio weights.

python

1# dqn_portfolio.py
2import numpy as np
3import torch
4import torch.nn as nn
5import torch.optim as optim
6from collections import deque
7import random
8from typing import Tuple, List
9
10class PortfolioEnvironment:
11    """
12    Portfolio management environment
13    
14    State: prices, returns, technical indicators, portfolio weights
15    Action: target portfolio weights (discretized)
16    Reward: risk-adjusted returns
17    """
18    
19    def __init__(
20        self,
21        price_data: np.ndarray,  # Shape: (timesteps, num_assets)
22        initial_capital: float = 1000000.0,
23        transaction_cost: float = 0.001,  # 10 bps
24        window_size: int = 20
25    ):
26        self.price_data = price_data
27        self.num_assets = price_data.shape[1]
28        self.initial_capital = initial_capital
29        self.transaction_cost = transaction_cost
30        self.window_size = window_size
31        
32        # State
33        self.current_step = window_size
34        self.portfolio_value = initial_capital
35        self.portfolio_weights = np.zeros(self.num_assets)
36        self.cash_weight = 1.0
37        
38        # History
39        self.portfolio_values = [initial_capital]
40        
41    def reset(self) -> np.ndarray:
42        """Reset environment"""
43        self.current_step = self.window_size
44        self.portfolio_value = self.initial_capital
45        self.portfolio_weights = np.zeros(self.num_assets)
46        self.cash_weight = 1.0
47        self.portfolio_values = [self.initial_capital]
48        
49        return self._get_state()
50    
51    def _get_state(self) -> np.ndarray:
52        """
53        Construct state vector
54        
55        State components:
56        1. Recent price returns (window_size x num_assets)
57        2. Current portfolio weights
58        3. Portfolio metrics (volatility, Sharpe, etc.)
59        """
60        # Price returns
61        prices = self.price_data[
62            self.current_step - self.window_size:self.current_step
63        ]
64        returns = np.diff(prices, axis=0) / prices[:-1]
65        
66        # Flatten returns
67        returns_flat = returns.flatten()
68        
69        # Current weights
70        weights = np.concatenate([
71            self.portfolio_weights,
72            [self.cash_weight]
73        ])
74        
75        # Portfolio statistics
76        portfolio_returns = np.array(self.portfolio_values)
77        portfolio_returns = np.diff(portfolio_returns) / portfolio_returns[:-1]
78        
79        if len(portfolio_returns) > 1:
80            volatility = np.std(portfolio_returns)
81            sharpe = np.mean(portfolio_returns) / (volatility + 1e-8)
82        else:
83            volatility = 0
84            sharpe = 0
85        
86        stats = np.array([volatility, sharpe])
87        
88        # Concatenate all components
89        state = np.concatenate([returns_flat, weights, stats])
90        
91        return state.astype(np.float32)
92    
93    def step(self, action: int) -> Tuple[np.ndarray, float, bool]:
94        """
95        Take action and return (next_state, reward, done)
96        
97        action: index into discretized weight grid
98        """
99        # Decode action to target weights
100        target_weights = self._decode_action(action)
101        
102        # Calculate transaction costs
103        weight_changes = np.abs(target_weights - self.portfolio_weights)
104        transaction_costs = np.sum(weight_changes) * self.transaction_cost
105        
106        # Execute rebalancing
107        self.portfolio_weights = target_weights
108        self.cash_weight = 1.0 - np.sum(self.portfolio_weights)
109        
110        # Apply transaction costs
111        self.portfolio_value *= (1 - transaction_costs)
112        
113        # Advance time
114        self.current_step += 1
115        
116        # Calculate returns
117        if self.current_step < len(self.price_data):
118            price_returns = (
119                self.price_data[self.current_step] /
120                self.price_data[self.current_step - 1] - 1
121            )
122            
123            # Portfolio return
124            portfolio_return = np.dot(self.portfolio_weights, price_returns)
125            self.portfolio_value *= (1 + portfolio_return)
126            self.portfolio_values.append(self.portfolio_value)
127            
128            # Calculate reward (risk-adjusted return)
129            reward = self._calculate_reward(portfolio_return, transaction_costs)
130        else:
131            reward = 0
132        
133        # Check if done
134        done = (
135            self.current_step >= len(self.price_data) - 1 or
136            self.portfolio_value < self.initial_capital * 0.5  # Stop loss
137        )
138        
139        next_state = self._get_state() if not done else np.zeros_like(self._get_state())
140        
141        return next_state, reward, done
142    
143    def _decode_action(self, action: int) -> np.ndarray:
144        """
145        Decode discrete action to portfolio weights
146        
147        Simple discretization: equal-weight, overweight asset i, underweight asset i
148        """
149        weights = np.zeros(self.num_assets)
150        
151        if action == 0:
152            # All cash
153            pass
154        elif action <= self.num_assets:
155            # Equal weight
156            weights = np.ones(self.num_assets) / self.num_assets
157        elif action <= 2 * self.num_assets:
158            # Overweight asset
159            asset_idx = action - self.num_assets - 1
160            weights[asset_idx] = 0.5
161            weights = weights / np.sum(weights)
162        else:
163            # Long-short (if allowed)
164            pass
165        
166        return weights
167    
168    def _calculate_reward(
169        self,
170        portfolio_return: float,
171        transaction_costs: float
172    ) -> float:
173        """
174        Calculate reward (Sharpe-like objective)
175        
176        reward = return - λ * risk - transaction_costs
177        """
178        # Risk penalty
179        recent_returns = np.array(self.portfolio_values[-20:])
180        if len(recent_returns) > 1:
181            recent_returns = np.diff(recent_returns) / recent_returns[:-1]
182            volatility = np.std(recent_returns)
183        else:
184            volatility = 0
185        
186        risk_aversion = 0.5
187        reward = portfolio_return - risk_aversion * volatility - transaction_costs
188        
189        return reward * 100  # Scale for training stability
190
191class DQNNetwork(nn.Module):
192    """Deep Q-Network for portfolio allocation"""
193    
194    def __init__(self, state_dim: int, num_actions: int):
195        super().__init__()
196        
197        self.network = nn.Sequential(
198            nn.Linear(state_dim, 256),
199            nn.ReLU(),
200            nn.Dropout(0.2),
201            nn.Linear(256, 256),
202            nn.ReLU(),
203            nn.Dropout(0.2),
204            nn.Linear(256, 128),
205            nn.ReLU(),
206            nn.Linear(128, num_actions)
207        )
208        
209    def forward(self, state: torch.Tensor) -> torch.Tensor:
210        return self.network(state)
211
212class DQNAgent:
213    """DQN agent for portfolio management"""
214    
215    def __init__(
216        self,
217        state_dim: int,
218        num_actions: int,
219        learning_rate: float = 0.001,
220        gamma: float = 0.99,
221        epsilon_start: float = 1.0,
222        epsilon_end: float = 0.01,
223        epsilon_decay: float = 0.995,
224        memory_size: int = 10000,
225        batch_size: int = 64
226    ):
227        self.num_actions = num_actions
228        self.gamma = gamma
229        self.epsilon = epsilon_start
230        self.epsilon_end = epsilon_end
231        self.epsilon_decay = epsilon_decay
232        self.batch_size = batch_size
233        
234        # Networks
235        self.q_network = DQNNetwork(state_dim, num_actions)
236        self.target_network = DQNNetwork(state_dim, num_actions)
237        self.target_network.load_state_dict(self.q_network.state_dict())
238        
239        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
240        self.memory = deque(maxlen=memory_size)
241        
242    def select_action(self, state: np.ndarray, training: bool = True) -> int:
243        """Epsilon-greedy action selection"""
244        if training and random.random() < self.epsilon:
245            return random.randint(0, self.num_actions - 1)
246        
247        with torch.no_grad():
248            state_tensor = torch.FloatTensor(state).unsqueeze(0)
249            q_values = self.q_network(state_tensor)
250            return q_values.argmax().item()
251    
252    def store_transition(
253        self,
254        state: np.ndarray,
255        action: int,
256        reward: float,
257        next_state: np.ndarray,
258        done: bool
259    ):
260        """Store experience in replay memory"""
261        self.memory.append((state, action, reward, next_state, done))
262    
263    def train_step(self):
264        """Perform one training step"""
265        if len(self.memory) < self.batch_size:
266            return
267        
268        # Sample batch
269        batch = random.sample(self.memory, self.batch_size)
270        states, actions, rewards, next_states, dones = zip(*batch)
271        
272        states = torch.FloatTensor(np.array(states))
273        actions = torch.LongTensor(actions)
274        rewards = torch.FloatTensor(rewards)
275        next_states = torch.FloatTensor(np.array(next_states))
276        dones = torch.FloatTensor(dones)
277        
278        # Current Q values
279        current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1))
280        
281        # Target Q values
282        with torch.no_grad():
283            next_q_values = self.target_network(next_states).max(1)[0]
284            target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
285        
286        # Loss
287        loss = nn.MSELoss()(current_q_values.squeeze(), target_q_values)
288        
289        # Optimize
290        self.optimizer.zero_grad()
291        loss.backward()
292        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 1.0)
293        self.optimizer.step()
294        
295        # Decay epsilon
296        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
297        
298        return loss.item()
299    
300    def update_target_network(self):
301        """Copy weights from Q-network to target network"""
302        self.target_network.load_state_dict(self.q_network.state_dict())
303
304# Training loop
305def train_dqn_portfolio(
306    price_data: np.ndarray,
307    num_episodes: int = 1000,
308    update_target_every: int = 10
309):
310    """Train DQN agent for portfolio management"""
311    
312    env = PortfolioEnvironment(price_data)
313    state_dim = len(env._get_state())
314    num_actions = env.num_assets * 2 + 1  # Simplified action space
315    
316    agent = DQNAgent(state_dim, num_actions)
317    
318    episode_returns = []
319    
320    for episode in range(num_episodes):
321        state = env.reset()
322        episode_reward = 0
323        done = False
324        
325        while not done:
326            # Select and perform action
327            action = agent.select_action(state)
328            next_state, reward, done = env.step(action)
329            
330            # Store transition
331            agent.store_transition(state, action, reward, next_state, done)
332            
333            # Train
334            loss = agent.train_step()
335            
336            episode_reward += reward
337            state = next_state
338        
339        # Update target network
340        if episode % update_target_every == 0:
341            agent.update_target_network()
342        
343        # Record performance
344        final_value = env.portfolio_value
345        total_return = (final_value / env.initial_capital - 1) * 100
346        episode_returns.append(total_return)
347        
348        if episode % 10 == 0:
349            print(f"Episode {episode}, Return: {total_return:.2f}%, "
350                  f"Epsilon: {agent.epsilon:.3f}")
351    
352    return agent, episode_returns
353
354# Example usage
355if __name__ == '__main__':
356    # Generate synthetic price data
357    np.random.seed(42)
358    num_timesteps = 1000
359    num_assets = 5
360    
361    # Random walk with drift
362    returns = np.random.normal(0.0001, 0.02, (num_timesteps, num_assets))
363    prices = 100 * np.exp(np.cumsum(returns, axis=0))
364    
365    # Train agent
366    agent, returns_history = train_dqn_portfolio(prices, num_episodes=100)
367    
368    print(f"\nFinal average return: {np.mean(returns_history[-10:]):.2f}%")
369

Policy Gradient Methods (PPO)#

Direct policy optimization for continuous actions.

python

1# ppo_portfolio.py
2import torch
3import torch.nn as nn
4import torch.optim as optim
5from torch.distributions import Normal
6import numpy as np
7
8class ActorCritic(nn.Module):
9    """Actor-Critic network for PPO"""
10    
11    def __init__(self, state_dim: int, action_dim: int):
12        super().__init__()
13        
14        # Shared layers
15        self.shared = nn.Sequential(
16            nn.Linear(state_dim, 256),
17            nn.ReLU(),
18            nn.Linear(256, 256),
19            nn.ReLU()
20        )
21        
22        # Actor head (policy)
23        self.actor_mean = nn.Linear(256, action_dim)
24        self.actor_logstd = nn.Parameter(torch.zeros(action_dim))
25        
26        # Critic head (value function)
27        self.critic = nn.Linear(256, 1)
28        
29    def forward(self, state):
30        shared_features = self.shared(state)
31        return shared_features
32    
33    def get_action_and_value(self, state, action=None):
34        """Get action distribution and value estimate"""
35        features = self.forward(state)
36        
37        # Actor
38        action_mean = torch.tanh(self.actor_mean(features))  # [-1, 1]
39        action_std = torch.exp(self.actor_logstd)
40        action_dist = Normal(action_mean, action_std)
41        
42        if action is None:
43            action = action_dist.sample()
44        
45        log_prob = action_dist.log_prob(action).sum(axis=-1)
46        entropy = action_dist.entropy().sum(axis=-1)
47        
48        # Critic
49        value = self.critic(features)
50        
51        return action, log_prob, entropy, value
52
53class PPOAgent:
54    """PPO agent for continuous portfolio allocation"""
55    
56    def __init__(
57        self,
58        state_dim: int,
59        action_dim: int,
60        learning_rate: float = 3e-4,
61        gamma: float = 0.99,
62        gae_lambda: float = 0.95,
63        clip_epsilon: float = 0.2,
64        epochs: int = 10,
65        batch_size: int = 64
66    ):
67        self.gamma = gamma
68        self.gae_lambda = gae_lambda
69        self.clip_epsilon = clip_epsilon
70        self.epochs = epochs
71        self.batch_size = batch_size
72        
73        self.model = ActorCritic(state_dim, action_dim)
74        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
75        
76    def select_action(self, state: np.ndarray):
77        """Select action from policy"""
78        state_tensor = torch.FloatTensor(state).unsqueeze(0)
79        
80        with torch.no_grad():
81            action, log_prob, _, value = self.model.get_action_and_value(state_tensor)
82        
83        return action.cpu().numpy()[0], log_prob.cpu().item(), value.cpu().item()
84    
85    def compute_gae(
86        self,
87        rewards: list,
88        values: list,
89        dones: list
90    ) -> tuple:
91        """Compute Generalized Advantage Estimation"""
92        advantages = []
93        returns = []
94        
95        advantage = 0
96        next_value = 0
97        
98        for t in reversed(range(len(rewards))):
99            if t == len(rewards) - 1:
100                next_value = 0
101                next_done = 1.0 if dones[t] else 0.0
102            else:
103                next_value = values[t + 1]
104                next_done = 1.0 if dones[t] else 0.0
105            
106            delta = rewards[t] + self.gamma * next_value * (1 - next_done) - values[t]
107            advantage = delta + self.gamma * self.gae_lambda * (1 - next_done) * advantage
108            
109            advantages.insert(0, advantage)
110            returns.insert(0, advantage + values[t])
111        
112        return advantages, returns
113    
114    def update(self, trajectories: dict):
115        """Update policy using PPO"""
116        states = torch.FloatTensor(trajectories['states'])
117        actions = torch.FloatTensor(trajectories['actions'])
118        old_log_probs = torch.FloatTensor(trajectories['log_probs'])
119        advantages = torch.FloatTensor(trajectories['advantages'])
120        returns = torch.FloatTensor(trajectories['returns'])
121        
122        # Normalize advantages
123        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
124        
125        # PPO epochs
126        for _ in range(self.epochs):
127            # Forward pass
128            _, log_probs, entropy, values = self.model.get_action_and_value(states, actions)
129            
130            # Policy loss (PPO clip objective)
131            ratio = torch.exp(log_probs - old_log_probs)
132            surr1 = ratio * advantages
133            surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * advantages
134            policy_loss = -torch.min(surr1, surr2).mean()
135            
136            # Value loss
137            value_loss = nn.MSELoss()(values.squeeze(), returns)
138            
139            # Entropy bonus (encourage exploration)
140            entropy_loss = -entropy.mean()
141            
142            # Total loss
143            loss = policy_loss + 0.5 * value_loss + 0.01 * entropy_loss
144            
145            # Optimize
146            self.optimizer.zero_grad()
147            loss.backward()
148            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 0.5)
149            self.optimizer.step()
150        
151        return loss.item()
152

Production Metrics #

Our RL portfolio system (2024):

Performance #

plaintext

1Backtest (2020-2023):
2- Total return: 74.2%
3- Annual return: 18.4%
4- Sharpe ratio: 0.89
5- Sortino ratio: 1.24
6- Max drawdown: -12.7%
7- Win rate: 54.2%
8
9Live Trading (2024, 9 months):
10- Return: 12.8% (annualized: 17.1%)
11- Sharpe: 0.84
12- Avg daily return: 0.05%
13- Volatility: 14.2% annualized
14- Correlation to S&P 500: 0.42
15

vs Benchmarks #

plaintext

1Strategy Comparison (3-year backtest):
2
3Method              Return    Sharpe   Max DD
4----------------------------------------------
5RL (PPO)            74.2%     0.89     -12.7%
6RL (DQN)            68.4%     0.82     -15.3%
7Mean-Variance       52.1%     0.64     -18.9%
8Equal Weight        48.7%     0.58     -21.2%
9Buy & Hold          41.3%     0.51     -24.6%
10
11RL outperforms traditional methods by 22% return
12

Training Efficiency #

plaintext

1Training Time:
2- DQN: 8 hours (1000 episodes)
3- PPO: 12 hours (500 episodes)
4- Hardware: NVIDIA RTX 3090
5
6Sample Efficiency:
7- DQN: Converges after ~400 episodes
8- PPO: Converges after ~200 episodes
9- Stable performance after convergence
10
11Hyperparameter Sensitivity:
12- Learning rate: High (test 1e-5 to 1e-3)
13- Gamma: Medium (0.95-0.99)
14- Reward shaping: Critical
15

Reward Shaping Best Practices #

Critical for financial RL:

1. Risk-Adjusted Returns #

python

1# Sharpe-like reward
2def sharpe_reward(returns, risk_free_rate=0.02):
3    if len(returns) < 2:
4        return 0
5    mean_return = np.mean(returns)
6    std_return = np.std(returns) + 1e-8
7    return (mean_return - risk_free_rate/252) / std_return
8

2. Drawdown Penalty #

python

1# Penalize drawdowns heavily
2def drawdown_penalty(portfolio_values, max_dd_threshold=0.15):
3    peak = np.maximum.accumulate(portfolio_values)
4    drawdown = (peak - portfolio_values) / peak
5    penalty = np.where(
6        drawdown > max_dd_threshold,
7        -10 * drawdown,  # Heavy penalty
8        0
9    )
10    return penalty[-1]
11

3. Transaction Cost Integration #

python

1# Include realistic costs
2def transaction_cost_reward(
3    old_weights,
4    new_weights,
5    cost_bps=10
6):
7    turnover = np.sum(np.abs(new_weights - old_weights))
8    cost = turnover * (cost_bps / 10000)
9    return -cost
10

4. Diversification Bonus #

python

1# Reward diversification
2def diversification_reward(weights):
3    # Herfindahl index (lower = more diversified)
4    herfindahl = np.sum(weights ** 2)
5    # Bonus for diversification
6    return -herfindahl  # Maximize negative (minimize concentration)
7

Lessons Learned #

After 3 years deploying RL for portfolios:

Reward engineering critical: 80% of effort, determines success
PPO more stable than DQN: Better for continuous actions
Offline RL underrated: Learn from historical data first
Transaction costs matter: Include in reward immediately
Risk management essential: Hard constraints + reward penalties
Sample efficiency poor: Needs lots of data (years of ticks)
Hyperparameter sensitive: Extensive tuning required
Overfitting real risk: Use walk-forward validation
Interpretability lacking: Black box decisions concerning
Maintenance intensive: Retrain monthly minimum

RL works but requires significant engineering effort.

Reinforcement Learning for Portfolio Management

Why Reinforcement Learning #

Deep Q-Network (DQN) for Allocation #

Policy Gradient Methods (PPO)#

Production Metrics #

Performance #

vs Benchmarks #

Training Efficiency #

Reward Shaping Best Practices #

1. Risk-Adjusted Returns #

2. Drawdown Penalty #

3. Transaction Cost Integration #

4. Diversification Bonus #

Lessons Learned #

Further Reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts

Reinforcement Learning for Portfolio Management

Why Reinforcement Learning #

Deep Q-Network (DQN) for Allocation #

Policy Gradient Methods (PPO)#

Production Metrics #

Performance #

vs Benchmarks #

Training Efficiency #

Reward Shaping Best Practices #

1. Risk-Adjusted Returns #

2. Drawdown Penalty #

3. Transaction Cost Integration #

4. Diversification Bonus #

Lessons Learned #

Further Reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts