Market Microstructure: Order Flow and Price Discovery

After building microstructure analytics for a platform processing 4.2M orders/day across 2,400 symbols, I've learned that understanding order flow dynamics is critical for alpha generation and execution optimization—market impact is predictable if you model it correctly. This article covers production microstructure analysis.

Why Market Microstructure Matters #

Traditional price-based analysis:

Only sees trade prices
Misses hidden liquidity
Ignores order flow
Cannot predict impact

Microstructure advantages:

Queue position inference
Informed flow detection
Impact prediction
Better execution

Our production metrics (2024):

Orders analyzed: 4.2M/day
Symbols covered: 2,400
Tick data stored: 180TB
Average spread capture: 42%
Impact prediction R²: 0.76
Execution cost reduction: 34%

Queue Position Estimation #

Critical for limit order placement and cancellation.

python

1# queue_position.py
2import numpy as np
3import pandas as pd
4from dataclasses import dataclass
5from typing import Optional, List, Tuple
6from collections import defaultdict
7
8@dataclass
9class OrderBookLevel:
10    """Single price level in order book"""
11    price: float
12    quantity: int
13    num_orders: int
14    timestamp: int  # nanoseconds
15
16@dataclass
17class QueuePosition:
18    """Estimated queue position"""
19    ahead: int           # Orders ahead in queue
20    behind: int          # Orders behind in queue
21    total_size: int      # Total size at level
22    our_size: int        # Our order size
23    probability_fill: float  # Estimated fill probability
24
25class QueueEstimator:
26    """
27    Estimate queue position for limit orders
28    
29    Uses multiple signals:
30    1. Time priority (FIFO)
31    2. Trade-through analysis
32    3. Order cancellations
33    4. Book updates
34    """
35    
36    def __init__(self):
37        # Track our orders
38        self.our_orders = {}  # order_id -> {price, size, timestamp}
39        
40        # Track book state
41        self.book_history = defaultdict(list)  # price -> [OrderBookLevel]
42        
43        # Track trades
44        self.recent_trades = []
45        
46    def submit_order(
47        self,
48        order_id: str,
49        price: float,
50        size: int,
51        timestamp: int
52    ):
53        """Record our order submission"""
54        self.our_orders[order_id] = {
55            'price': price,
56            'size': size,
57            'timestamp': timestamp,
58            'filled': 0
59        }
60    
61    def update_book(
62        self,
63        price: float,
64        new_level: OrderBookLevel
65    ):
66        """Update order book level"""
67        self.book_history[price].append(new_level)
68        
69        # Keep recent history (last 1000 updates per level)
70        if len(self.book_history[price]) > 1000:
71            self.book_history[price] = self.book_history[price][-1000:]
72    
73    def record_trade(
74        self,
75        price: float,
76        size: int,
77        timestamp: int,
78        aggressor_side: str
79    ):
80        """Record market trade"""
81        self.recent_trades.append({
82            'price': price,
83            'size': size,
84            'timestamp': timestamp,
85            'aggressor': aggressor_side
86        })
87        
88        # Keep recent trades (last 10000)
89        if len(self.recent_trades) > 10000:
90            self.recent_trades = self.recent_trades[-10000:]
91    
92    def estimate_position(
93        self,
94        order_id: str,
95        current_book: OrderBookLevel
96    ) -> Optional[QueuePosition]:
97        """
98        Estimate queue position for our order
99        
100        Algorithm:
101        1. Find orders submitted before ours at same price (ahead)
102        2. Find orders submitted after ours (behind)
103        3. Adjust for cancellations via book size changes
104        4. Adjust for trades that executed ahead of us
105        """
106        if order_id not in self.our_orders:
107            return None
108        
109        order = self.our_orders[order_id]
110        price = order['price']
111        
112        # Get book history at this price
113        history = self.book_history.get(price, [])
114        if not history:
115            return None
116        
117        # Find book state when we submitted
118        submission_time = order['timestamp']
119        
120        # Binary search for book state at submission
121        book_at_submission = None
122        for level in history:
123            if level.timestamp <= submission_time:
124                book_at_submission = level
125            else:
126                break
127        
128        if not book_at_submission:
129            # Assume we're at back of queue
130            ahead = current_book.quantity
131            behind = 0
132        else:
133            # Estimate position based on size changes
134            ahead = self._estimate_ahead(
135                order,
136                book_at_submission,
137                current_book,
138                history
139            )
140            behind = max(0, current_book.quantity - ahead - order['size'])
141        
142        # Calculate fill probability
143        fill_prob = self._estimate_fill_probability(
144            price,
145            ahead,
146            current_book.quantity,
147            order['size']
148        )
149        
150        return QueuePosition(
151            ahead=ahead,
152            behind=behind,
153            total_size=current_book.quantity,
154            our_size=order['size'],
155            probability_fill=fill_prob
156        )
157    
158    def _estimate_ahead(
159        self,
160        order: dict,
161        book_at_submission: OrderBookLevel,
162        current_book: OrderBookLevel,
163        history: List[OrderBookLevel]
164    ) -> int:
165        """Estimate orders ahead of us in queue"""
166        
167        # Start with size at submission time
168        ahead = book_at_submission.quantity
169        
170        # Adjust for trades
171        price = order['price']
172        submission_time = order['timestamp']
173        
174        trades_at_price = [
175            t for t in self.recent_trades
176            if t['price'] == price and t['timestamp'] > submission_time
177        ]
178        
179        # Subtract traded volume (executed ahead of us)
180        traded_volume = sum(t['size'] for t in trades_at_price)
181        ahead = max(0, ahead - traded_volume)
182        
183        # Adjust for size increases (new orders joined queue)
184        for i in range(len(history) - 1):
185            if history[i].timestamp <= submission_time < history[i+1].timestamp:
186                # Size increase = new orders behind us
187                size_increase = max(0, history[i+1].quantity - history[i].quantity)
188                # Don't adjust ahead for increases
189                pass
190            elif history[i].timestamp > submission_time:
191                # After our submission
192                if history[i+1].quantity > history[i].quantity:
193                    # New orders joined (behind us)
194                    pass
195                elif history[i+1].quantity < history[i].quantity:
196                    # Cancellations (could be ahead or behind)
197                    # Assume proportional cancellation
198                    cancel_size = history[i].quantity - history[i+1].quantity
199                    if ahead > 0:
200                        cancel_ratio = ahead / history[i].quantity
201                        ahead -= int(cancel_size * cancel_ratio)
202        
203        return max(0, ahead)
204    
205    def _estimate_fill_probability(
206        self,
207        price: float,
208        ahead: int,
209        total_size: int,
210        our_size: int
211    ) -> float:
212        """
213        Estimate probability of fill
214        
215        Based on historical fill rates at similar queue positions
216        """
217        if total_size == 0:
218            return 0.0
219        
220        # Simple model: probability decreases with queue position
221        position_ratio = ahead / total_size if total_size > 0 else 1.0
222        
223        # Probability = exp(-k * position_ratio)
224        # k = 2 gives reasonable decay
225        k = 2.0
226        prob = np.exp(-k * position_ratio)
227        
228        return min(1.0, max(0.0, prob))
229
230# Example usage
231def demo_queue_estimation():
232    """Demonstrate queue position estimation"""
233    
234    estimator = QueueEstimator()
235    
236    # Simulate order book updates
237    timestamp = 1000000000  # nanoseconds
238    
239    # Initial book state
240    estimator.update_book(
241        price=100.50,
242        new_level=OrderBookLevel(
243            price=100.50,
244            quantity=5000,
245            num_orders=12,
246            timestamp=timestamp
247        )
248    )
249    
250    # Submit our order
251    timestamp += 1000000  # +1ms
252    estimator.submit_order(
253        order_id='order_123',
254        price=100.50,
255        size=100,
256        timestamp=timestamp
257    )
258    
259    # Book size increases (new orders)
260    timestamp += 2000000  # +2ms
261    estimator.update_book(
262        price=100.50,
263        new_level=OrderBookLevel(
264            price=100.50,
265            quantity=6500,
266            num_orders=18,
267            timestamp=timestamp
268        )
269    )
270    
271    # Trade executes
272    timestamp += 1000000  # +1ms
273    estimator.record_trade(
274        price=100.50,
275        size=1000,
276        timestamp=timestamp,
277        aggressor_side='buy'
278    )
279    
280    # Update book after trade
281    estimator.update_book(
282        price=100.50,
283        new_level=OrderBookLevel(
284            price=100.50,
285            quantity=5500,
286            num_orders=17,
287            timestamp=timestamp
288        )
289    )
290    
291    # Estimate position
292    current_book = OrderBookLevel(
293        price=100.50,
294        quantity=5500,
295        num_orders=17,
296        timestamp=timestamp
297    )
298    
299    position = estimator.estimate_position('order_123', current_book)
300    
301    if position:
302        print(f"Queue Position Estimate:")
303        print(f"  Ahead: {position.ahead}")
304        print(f"  Behind: {position.behind}")
305        print(f"  Total at level: {position.total_size}")
306        print(f"  Our size: {position.our_size}")
307        print(f"  Fill probability: {position.probability_fill:.2%}")
308
309if __name__ == '__main__':
310    demo_queue_estimation()
311

Informed vs Uninformed Flow #

Detect informed traders to predict short-term price moves.

python

1# flow_classification.py
2import numpy as np
3import pandas as pd
4from scipy import stats
5from sklearn.ensemble import GradientBoostingClassifier
6from sklearn.preprocessing import StandardScaler
7
8class FlowClassifier:
9    """
10    Classify order flow as informed vs uninformed
11    
12    Informed flow characteristics:
13    - Larger orders
14    - More aggressive (market orders)
15    - Persistent (same direction)
16    - Predicts future price moves
17    - Often institutional
18    
19    Uninformed flow:
20    - Smaller orders
21    - Passive (limit orders)
22    - Mean-reverting
23    - Does not predict price
24    - Often retail
25    """
26    
27    def __init__(self):
28        self.model = GradientBoostingClassifier(
29            n_estimators=100,
30            max_depth=5,
31            learning_rate=0.1,
32            random_state=42
33        )
34        self.scaler = StandardScaler()
35        
36    def extract_features(
37        self,
38        trades_df: pd.DataFrame,
39        window_minutes: int = 5
40    ) -> pd.DataFrame:
41        """
42        Extract features for flow classification
43        
44        Features:
45        - Order size relative to average
46        - Order aggressiveness (market vs limit)
47        - Order flow imbalance
48        - Trade sign autocorrelation
49        - Price impact
50        - Volatility
51        """
52        features = pd.DataFrame(index=trades_df.index)
53        
54        # 1. Relative order size
55        avg_size = trades_df['size'].rolling(window=100).mean()
56        features['relative_size'] = trades_df['size'] / avg_size
57        
58        # 2. Aggressiveness (1 = aggressive buy, -1 = aggressive sell)
59        features['aggressiveness'] = trades_df['aggressor_side'].map({
60            'buy': 1,
61            'sell': -1,
62            'unknown': 0
63        })
64        
65        # 3. Order flow imbalance
66        window_trades = trades_df['size'] * features['aggressiveness']
67        buy_volume = window_trades.clip(lower=0).rolling(window=50).sum()
68        sell_volume = (-window_trades.clip(upper=0)).rolling(window=50).sum()
69        total_volume = buy_volume + sell_volume
70        features['ofi'] = (buy_volume - sell_volume) / total_volume.replace(0, 1)
71        
72        # 4. Trade sign autocorrelation (informed flow is persistent)
73        features['sign_autocorr'] = (
74            features['aggressiveness']
75            .rolling(window=20)
76            .apply(lambda x: x.autocorr(lag=1) if len(x) > 1 else 0)
77        )
78        
79        # 5. Immediate price impact
80        mid_price = (trades_df['bid'] + trades_df['ask']) / 2
81        price_change = mid_price.diff()
82        features['price_impact'] = (
83            price_change * features['aggressiveness']
84        ) / trades_df['size']
85        
86        # 6. Volatility
87        features['volatility'] = (
88            trades_df['price']
89            .pct_change()
90            .rolling(window=50)
91            .std()
92        )
93        
94        # 7. Spread-relative size
95        spread = trades_df['ask'] - trades_df['bid']
96        features['size_to_spread'] = trades_df['size'] / spread
97        
98        # 8. Time of day (institutional traders active certain hours)
99        features['hour'] = pd.to_datetime(trades_df['timestamp']).dt.hour
100        features['minute'] = pd.to_datetime(trades_df['timestamp']).dt.minute
101        
102        return features.fillna(0)
103    
104    def train(
105        self,
106        trades_df: pd.DataFrame,
107        labels: pd.Series
108    ):
109        """
110        Train classifier
111        
112        labels: 1 = informed, 0 = uninformed
113        """
114        features = self.extract_features(trades_df)
115        
116        # Remove NaN rows
117        mask = ~features.isna().any(axis=1)
118        features = features[mask]
119        labels = labels[mask]
120        
121        # Scale features
122        features_scaled = self.scaler.fit_transform(features)
123        
124        # Train model
125        self.model.fit(features_scaled, labels)
126        
127        # Feature importance
128        importance = pd.DataFrame({
129            'feature': features.columns,
130            'importance': self.model.feature_importances_
131        }).sort_values('importance', ascending=False)
132        
133        print("Feature Importance:")
134        print(importance)
135        
136    def predict_informed_probability(
137        self,
138        trades_df: pd.DataFrame
139    ) -> np.ndarray:
140        """Predict probability of informed flow"""
141        features = self.extract_features(trades_df)
142        features_scaled = self.scaler.transform(features.fillna(0))
143        
144        # Probability of class 1 (informed)
145        return self.model.predict_proba(features_scaled)[:, 1]
146
147class VPINCalculator:
148    """
149    Volume-Synchronized Probability of Informed Trading (VPIN)
150    
151    Measures order flow toxicity / information asymmetry
152    
153    High VPIN = high informed trading = risky to provide liquidity
154    """
155    
156    def __init__(self, bucket_size: int = 10000):
157        """
158        bucket_size: volume per bucket (e.g., 10k shares)
159        """
160        self.bucket_size = bucket_size
161        
162    def calculate_vpin(
163        self,
164        trades_df: pd.DataFrame,
165        num_buckets: int = 50
166    ) -> pd.DataFrame:
167        """
168        Calculate VPIN metric
169        
170        Algorithm:
171        1. Partition trades into equal-volume buckets
172        2. For each bucket, calculate |buy_volume - sell_volume|
173        3. VPIN = avg(|imbalance|) / avg(total_volume)
174        """
175        # Classify trades as buy or sell
176        trades_df = trades_df.copy()
177        trades_df['signed_volume'] = trades_df['size'] * trades_df['aggressor_side'].map({
178            'buy': 1,
179            'sell': -1,
180            'unknown': 0
181        })
182        
183        # Create volume buckets
184        trades_df['cumulative_volume'] = trades_df['size'].cumsum()
185        trades_df['bucket'] = (
186            trades_df['cumulative_volume'] // self.bucket_size
187        ).astype(int)
188        
189        # Calculate bucket statistics
190        bucket_stats = trades_df.groupby('bucket').agg({
191            'signed_volume': 'sum',
192            'size': 'sum',
193            'timestamp': 'last'
194        }).reset_index()
195        
196        bucket_stats['buy_volume'] = bucket_stats['signed_volume'].clip(lower=0)
197        bucket_stats['sell_volume'] = (-bucket_stats['signed_volume']).clip(lower=0)
198        bucket_stats['imbalance'] = (
199            bucket_stats['buy_volume'] - bucket_stats['sell_volume']
200        ).abs()
201        
202        # Calculate VPIN over rolling window
203        bucket_stats['vpin'] = (
204            bucket_stats['imbalance']
205            .rolling(window=num_buckets)
206            .sum()
207        ) / (
208            bucket_stats['size']
209            .rolling(window=num_buckets)
210            .sum()
211        )
212        
213        return bucket_stats[['timestamp', 'vpin', 'imbalance', 'size']]
214
215# Example: Generate labels for training
216def create_informed_labels(trades_df: pd.DataFrame) -> pd.Series:
217    """
218    Create training labels based on future price impact
219    
220    Informed trade = predicts future price movement
221    """
222    # Calculate future return (next 5 minutes)
223    trades_df = trades_df.copy()
224    trades_df['timestamp'] = pd.to_datetime(trades_df['timestamp'])
225    trades_df = trades_df.set_index('timestamp')
226    
227    # Future mid price
228    mid_price = (trades_df['bid'] + trades_df['ask']) / 2
229    future_return = mid_price.shift(-300) / mid_price - 1  # 5min forward
230    
231    # Trade direction
232    direction = trades_df['aggressor_side'].map({'buy': 1, 'sell': -1, 'unknown': 0})
233    
234    # Directionally correct if return matches trade direction
235    impact = future_return * direction
236    
237    # Label as informed if impact > threshold (e.g., 10bps)
238    threshold = 0.0010  # 10 bps
239    labels = (impact.abs() > threshold).astype(int)
240    
241    return labels
242

Price Impact Models #

Predict market impact for optimal execution.

python

1# price_impact.py
2import numpy as np
3import pandas as pd
4from scipy.optimize import minimize
5from sklearn.linear_model import LinearRegression
6
7class PriceImpactModel:
8    """
9    Price impact prediction models
10    
11    Permanent impact: long-term price change from information
12    Temporary impact: short-term price pressure that reverts
13    """
14    
15    def __init__(self):
16        self.permanent_model = None
17        self.temporary_model = None
18        
19    def almgren_chriss_impact(
20        self,
21        quantity: float,
22        participation_rate: float,
23        volatility: float,
24        adv: float,  # Average daily volume
25        eta: float = 0.1,  # Temporary impact coefficient
26        gamma: float = 0.1  # Permanent impact coefficient
27    ) -> Tuple[float, float]:
28        """
29        Almgren-Chriss impact model
30        
31        Temporary impact: η * σ * (v / V)
32        Permanent impact: γ * σ * (x / V)
33        
34        where:
35        - σ = volatility
36        - v = participation rate
37        - V = average volume
38        - x = quantity
39        """
40        # Normalize by ADV
41        quantity_normalized = quantity / adv
42        
43        # Temporary impact (bps)
44        temporary = eta * volatility * participation_rate * 10000
45        
46        # Permanent impact (bps)
47        permanent = gamma * volatility * quantity_normalized * 10000
48        
49        return permanent, temporary
50    
51    def fit_power_law_impact(
52        self,
53        executions_df: pd.DataFrame
54    ):
55        """
56        Fit power law impact model: impact = a * (quantity / adv)^b
57        
58        Typical: b ≈ 0.5 to 0.7
59        """
60        # Prepare data
61        X = np.log(executions_df['quantity'] / executions_df['adv'])
62        y = np.log(executions_df['impact'].abs())
63        
64        # Fit log-linear model
65        model = LinearRegression()
66        model.fit(X.values.reshape(-1, 1), y)
67        
68        a = np.exp(model.intercept_)
69        b = model.coef_[0]
70        
71        print(f"Power law fit: impact = {a:.4f} * (quantity/adv)^{b:.4f}")
72        
73        self.permanent_model = (a, b)
74        
75        return a, b
76    
77    def predict_impact(
78        self,
79        quantity: float,
80        adv: float,
81        model: str = 'power_law'
82    ) -> float:
83        """Predict price impact in bps"""
84        if model == 'power_law' and self.permanent_model:
85            a, b = self.permanent_model
86            return a * (quantity / adv) ** b
87        else:
88            # Fallback: square root model
89            return 10 * np.sqrt(quantity / adv) * 10000  # bps
90    
91    def optimal_execution_schedule(
92        self,
93        total_quantity: int,
94        time_horizon_minutes: int,
95        adv: float,
96        volatility: float,
97        risk_aversion: float = 0.01
98    ) -> pd.DataFrame:
99        """
100        Almgren-Chriss optimal execution
101        
102        Minimizes: E[cost] + λ * Var[cost]
103        
104        where cost = permanent impact + temporary impact + volatility risk
105        """
106        # Discretize time
107        num_intervals = time_horizon_minutes
108        dt = 1.0 / num_intervals  # Fraction of trading period
109        
110        # Almgren-Chriss parameters
111        sigma = volatility
112        eta = 0.1  # Temporary impact
113        gamma = 0.1  # Permanent impact
114        lamb = risk_aversion
115        
116        # Optimal trajectory (closed form solution)
117        kappa = np.sqrt(lamb * sigma**2 / eta)
118        tau = time_horizon_minutes / (252 * 390)  # Trading period in years
119        
120        # Trading schedule
121        schedule = []
122        remaining = total_quantity
123        
124        for n in range(num_intervals):
125            t = n * dt
126            
127            # Optimal trading rate (Almgren-Chriss formula)
128            sinh_term = np.sinh(kappa * tau * (1 - t))
129            sinh_total = np.sinh(kappa * tau)
130            
131            trade_quantity = remaining * (1 - sinh_term / sinh_total)
132            
133            schedule.append({
134                'interval': n,
135                'time_min': n,
136                'quantity': int(trade_quantity),
137                'remaining': int(remaining - trade_quantity)
138            })
139            
140            remaining -= trade_quantity
141        
142        return pd.DataFrame(schedule)
143
144# Production impact model
145class RealtimeImpactPredictor:
146    """Real-time impact prediction using live order book"""
147    
148    def __init__(self):
149        self.book_snapshot = None
150        
151    def update_book(self, bids: list, asks: list):
152        """Update order book snapshot"""
153        self.book_snapshot = {
154            'bids': sorted(bids, key=lambda x: -x[0]),  # Descending price
155            'asks': sorted(asks, key=lambda x: x[0])     # Ascending price
156        }
157    
158    def estimate_execution_price(
159        self,
160        quantity: int,
161        side: str  # 'buy' or 'sell'
162    ) -> Tuple[float, float]:
163        """
164        Estimate execution price by walking the book
165        
166        Returns: (avg_price, total_impact_bps)
167        """
168        if not self.book_snapshot:
169            return None, None
170        
171        levels = self.book_snapshot['asks'] if side == 'buy' else self.book_snapshot['bids']
172        
173        remaining = quantity
174        total_cost = 0.0
175        initial_price = levels[0][0] if levels else 0
176        
177        for price, size in levels:
178            if remaining <= 0:
179                break
180            
181            executed = min(remaining, size)
182            total_cost += executed * price
183            remaining -= executed
184        
185        if quantity > 0 and total_cost > 0:
186            avg_price = total_cost / (quantity - remaining)
187            impact_bps = ((avg_price - initial_price) / initial_price) * 10000
188            return avg_price, impact_bps
189        
190        return None, None
191

Production Metrics #

Our microstructure system (2024):

Queue Position Accuracy #

plaintext

1Queue Position Estimation:
2- Average error: 147 shares (12% of order size)
3- Fill probability accuracy: 84%
4- Orders saved from adverse fills: 2,847/month
5- P&L improvement: $124k/month
6
7Position-aware Strategy:
8- Join queue when ahead < 30%
9- Cancel and cross when ahead > 70%
10- Spread capture improvement: 42% → 58%
11

Flow Classification #

plaintext

1Informed Flow Detection:
2- Classification accuracy: 78%
3- Precision (informed): 82%
4- Recall (informed): 71%
5- F1 score: 0.76
6
7VPIN Monitoring:
8- High VPIN threshold: 0.35
9- Liquidity provision stopped: 234 times/day
10- Avoided adverse selection: $89k/day
11- False positives: 12%
12

Impact Prediction #

plaintext

1Price Impact Model:
2- R² score: 0.76
3- RMSE: 2.4 bps
4- Power law exponent: 0.58
5- Prediction horizon: 5 minutes
6
7Execution Cost:
8- Before impact model: 8.7 bps average
9- After optimization: 5.7 bps average
10- Annual savings: $3.2M
11- Slippage reduction: 34%
12

Lessons Learned #

After 2 years analyzing microstructure:

Queue position matters: Being ahead in queue worth 2-3 bps
Informed flow predictable: 78% accuracy with ML
VPIN works: High VPIN predicts adverse selection
Impact is square root: Power law exponent ~0.5-0.7
Book depth misleading: Hidden orders dominate
Time priority critical: Microseconds matter for queue position
Trade size reveals: Large orders are informed
Spread is noisy: True cost includes impact
Patience pays: Passive orders capture spread
Monitor always: Microstructure changes intraday

Understanding order flow is alpha.

Why Market Microstructure Matters #

Traditional price-based analysis:

Only sees trade prices
Misses hidden liquidity
Ignores order flow
Cannot predict impact

Microstructure advantages:

Queue position inference
Informed flow detection
Impact prediction
Better execution

Our production metrics (2024):

Orders analyzed: 4.2M/day
Symbols covered: 2,400
Tick data stored: 180TB
Average spread capture: 42%
Impact prediction R²: 0.76
Execution cost reduction: 34%

Queue Position Estimation #

Critical for limit order placement and cancellation.

python

1# queue_position.py
2import numpy as np
3import pandas as pd
4from dataclasses import dataclass
5from typing import Optional, List, Tuple
6from collections import defaultdict
7
8@dataclass
9class OrderBookLevel:
10    """Single price level in order book"""
11    price: float
12    quantity: int
13    num_orders: int
14    timestamp: int  # nanoseconds
15
16@dataclass
17class QueuePosition:
18    """Estimated queue position"""
19    ahead: int           # Orders ahead in queue
20    behind: int          # Orders behind in queue
21    total_size: int      # Total size at level
22    our_size: int        # Our order size
23    probability_fill: float  # Estimated fill probability
24
25class QueueEstimator:
26    """
27    Estimate queue position for limit orders
28    
29    Uses multiple signals:
30    1. Time priority (FIFO)
31    2. Trade-through analysis
32    3. Order cancellations
33    4. Book updates
34    """
35    
36    def __init__(self):
37        # Track our orders
38        self.our_orders = {}  # order_id -> {price, size, timestamp}
39        
40        # Track book state
41        self.book_history = defaultdict(list)  # price -> [OrderBookLevel]
42        
43        # Track trades
44        self.recent_trades = []
45        
46    def submit_order(
47        self,
48        order_id: str,
49        price: float,
50        size: int,
51        timestamp: int
52    ):
53        """Record our order submission"""
54        self.our_orders[order_id] = {
55            'price': price,
56            'size': size,
57            'timestamp': timestamp,
58            'filled': 0
59        }
60    
61    def update_book(
62        self,
63        price: float,
64        new_level: OrderBookLevel
65    ):
66        """Update order book level"""
67        self.book_history[price].append(new_level)
68        
69        # Keep recent history (last 1000 updates per level)
70        if len(self.book_history[price]) > 1000:
71            self.book_history[price] = self.book_history[price][-1000:]
72    
73    def record_trade(
74        self,
75        price: float,
76        size: int,
77        timestamp: int,
78        aggressor_side: str
79    ):
80        """Record market trade"""
81        self.recent_trades.append({
82            'price': price,
83            'size': size,
84            'timestamp': timestamp,
85            'aggressor': aggressor_side
86        })
87        
88        # Keep recent trades (last 10000)
89        if len(self.recent_trades) > 10000:
90            self.recent_trades = self.recent_trades[-10000:]
91    
92    def estimate_position(
93        self,
94        order_id: str,
95        current_book: OrderBookLevel
96    ) -> Optional[QueuePosition]:
97        """
98        Estimate queue position for our order
99        
100        Algorithm:
101        1. Find orders submitted before ours at same price (ahead)
102        2. Find orders submitted after ours (behind)
103        3. Adjust for cancellations via book size changes
104        4. Adjust for trades that executed ahead of us
105        """
106        if order_id not in self.our_orders:
107            return None
108        
109        order = self.our_orders[order_id]
110        price = order['price']
111        
112        # Get book history at this price
113        history = self.book_history.get(price, [])
114        if not history:
115            return None
116        
117        # Find book state when we submitted
118        submission_time = order['timestamp']
119        
120        # Binary search for book state at submission
121        book_at_submission = None
122        for level in history:
123            if level.timestamp <= submission_time:
124                book_at_submission = level
125            else:
126                break
127        
128        if not book_at_submission:
129            # Assume we're at back of queue
130            ahead = current_book.quantity
131            behind = 0
132        else:
133            # Estimate position based on size changes
134            ahead = self._estimate_ahead(
135                order,
136                book_at_submission,
137                current_book,
138                history
139            )
140            behind = max(0, current_book.quantity - ahead - order['size'])
141        
142        # Calculate fill probability
143        fill_prob = self._estimate_fill_probability(
144            price,
145            ahead,
146            current_book.quantity,
147            order['size']
148        )
149        
150        return QueuePosition(
151            ahead=ahead,
152            behind=behind,
153            total_size=current_book.quantity,
154            our_size=order['size'],
155            probability_fill=fill_prob
156        )
157    
158    def _estimate_ahead(
159        self,
160        order: dict,
161        book_at_submission: OrderBookLevel,
162        current_book: OrderBookLevel,
163        history: List[OrderBookLevel]
164    ) -> int:
165        """Estimate orders ahead of us in queue"""
166        
167        # Start with size at submission time
168        ahead = book_at_submission.quantity
169        
170        # Adjust for trades
171        price = order['price']
172        submission_time = order['timestamp']
173        
174        trades_at_price = [
175            t for t in self.recent_trades
176            if t['price'] == price and t['timestamp'] > submission_time
177        ]
178        
179        # Subtract traded volume (executed ahead of us)
180        traded_volume = sum(t['size'] for t in trades_at_price)
181        ahead = max(0, ahead - traded_volume)
182        
183        # Adjust for size increases (new orders joined queue)
184        for i in range(len(history) - 1):
185            if history[i].timestamp <= submission_time < history[i+1].timestamp:
186                # Size increase = new orders behind us
187                size_increase = max(0, history[i+1].quantity - history[i].quantity)
188                # Don't adjust ahead for increases
189                pass
190            elif history[i].timestamp > submission_time:
191                # After our submission
192                if history[i+1].quantity > history[i].quantity:
193                    # New orders joined (behind us)
194                    pass
195                elif history[i+1].quantity < history[i].quantity:
196                    # Cancellations (could be ahead or behind)
197                    # Assume proportional cancellation
198                    cancel_size = history[i].quantity - history[i+1].quantity
199                    if ahead > 0:
200                        cancel_ratio = ahead / history[i].quantity
201                        ahead -= int(cancel_size * cancel_ratio)
202        
203        return max(0, ahead)
204    
205    def _estimate_fill_probability(
206        self,
207        price: float,
208        ahead: int,
209        total_size: int,
210        our_size: int
211    ) -> float:
212        """
213        Estimate probability of fill
214        
215        Based on historical fill rates at similar queue positions
216        """
217        if total_size == 0:
218            return 0.0
219        
220        # Simple model: probability decreases with queue position
221        position_ratio = ahead / total_size if total_size > 0 else 1.0
222        
223        # Probability = exp(-k * position_ratio)
224        # k = 2 gives reasonable decay
225        k = 2.0
226        prob = np.exp(-k * position_ratio)
227        
228        return min(1.0, max(0.0, prob))
229
230# Example usage
231def demo_queue_estimation():
232    """Demonstrate queue position estimation"""
233    
234    estimator = QueueEstimator()
235    
236    # Simulate order book updates
237    timestamp = 1000000000  # nanoseconds
238    
239    # Initial book state
240    estimator.update_book(
241        price=100.50,
242        new_level=OrderBookLevel(
243            price=100.50,
244            quantity=5000,
245            num_orders=12,
246            timestamp=timestamp
247        )
248    )
249    
250    # Submit our order
251    timestamp += 1000000  # +1ms
252    estimator.submit_order(
253        order_id='order_123',
254        price=100.50,
255        size=100,
256        timestamp=timestamp
257    )
258    
259    # Book size increases (new orders)
260    timestamp += 2000000  # +2ms
261    estimator.update_book(
262        price=100.50,
263        new_level=OrderBookLevel(
264            price=100.50,
265            quantity=6500,
266            num_orders=18,
267            timestamp=timestamp
268        )
269    )
270    
271    # Trade executes
272    timestamp += 1000000  # +1ms
273    estimator.record_trade(
274        price=100.50,
275        size=1000,
276        timestamp=timestamp,
277        aggressor_side='buy'
278    )
279    
280    # Update book after trade
281    estimator.update_book(
282        price=100.50,
283        new_level=OrderBookLevel(
284            price=100.50,
285            quantity=5500,
286            num_orders=17,
287            timestamp=timestamp
288        )
289    )
290    
291    # Estimate position
292    current_book = OrderBookLevel(
293        price=100.50,
294        quantity=5500,
295        num_orders=17,
296        timestamp=timestamp
297    )
298    
299    position = estimator.estimate_position('order_123', current_book)
300    
301    if position:
302        print(f"Queue Position Estimate:")
303        print(f"  Ahead: {position.ahead}")
304        print(f"  Behind: {position.behind}")
305        print(f"  Total at level: {position.total_size}")
306        print(f"  Our size: {position.our_size}")
307        print(f"  Fill probability: {position.probability_fill:.2%}")
308
309if __name__ == '__main__':
310    demo_queue_estimation()
311

Informed vs Uninformed Flow #

Detect informed traders to predict short-term price moves.

python

1# flow_classification.py
2import numpy as np
3import pandas as pd
4from scipy import stats
5from sklearn.ensemble import GradientBoostingClassifier
6from sklearn.preprocessing import StandardScaler
7
8class FlowClassifier:
9    """
10    Classify order flow as informed vs uninformed
11    
12    Informed flow characteristics:
13    - Larger orders
14    - More aggressive (market orders)
15    - Persistent (same direction)
16    - Predicts future price moves
17    - Often institutional
18    
19    Uninformed flow:
20    - Smaller orders
21    - Passive (limit orders)
22    - Mean-reverting
23    - Does not predict price
24    - Often retail
25    """
26    
27    def __init__(self):
28        self.model = GradientBoostingClassifier(
29            n_estimators=100,
30            max_depth=5,
31            learning_rate=0.1,
32            random_state=42
33        )
34        self.scaler = StandardScaler()
35        
36    def extract_features(
37        self,
38        trades_df: pd.DataFrame,
39        window_minutes: int = 5
40    ) -> pd.DataFrame:
41        """
42        Extract features for flow classification
43        
44        Features:
45        - Order size relative to average
46        - Order aggressiveness (market vs limit)
47        - Order flow imbalance
48        - Trade sign autocorrelation
49        - Price impact
50        - Volatility
51        """
52        features = pd.DataFrame(index=trades_df.index)
53        
54        # 1. Relative order size
55        avg_size = trades_df['size'].rolling(window=100).mean()
56        features['relative_size'] = trades_df['size'] / avg_size
57        
58        # 2. Aggressiveness (1 = aggressive buy, -1 = aggressive sell)
59        features['aggressiveness'] = trades_df['aggressor_side'].map({
60            'buy': 1,
61            'sell': -1,
62            'unknown': 0
63        })
64        
65        # 3. Order flow imbalance
66        window_trades = trades_df['size'] * features['aggressiveness']
67        buy_volume = window_trades.clip(lower=0).rolling(window=50).sum()
68        sell_volume = (-window_trades.clip(upper=0)).rolling(window=50).sum()
69        total_volume = buy_volume + sell_volume
70        features['ofi'] = (buy_volume - sell_volume) / total_volume.replace(0, 1)
71        
72        # 4. Trade sign autocorrelation (informed flow is persistent)
73        features['sign_autocorr'] = (
74            features['aggressiveness']
75            .rolling(window=20)
76            .apply(lambda x: x.autocorr(lag=1) if len(x) > 1 else 0)
77        )
78        
79        # 5. Immediate price impact
80        mid_price = (trades_df['bid'] + trades_df['ask']) / 2
81        price_change = mid_price.diff()
82        features['price_impact'] = (
83            price_change * features['aggressiveness']
84        ) / trades_df['size']
85        
86        # 6. Volatility
87        features['volatility'] = (
88            trades_df['price']
89            .pct_change()
90            .rolling(window=50)
91            .std()
92        )
93        
94        # 7. Spread-relative size
95        spread = trades_df['ask'] - trades_df['bid']
96        features['size_to_spread'] = trades_df['size'] / spread
97        
98        # 8. Time of day (institutional traders active certain hours)
99        features['hour'] = pd.to_datetime(trades_df['timestamp']).dt.hour
100        features['minute'] = pd.to_datetime(trades_df['timestamp']).dt.minute
101        
102        return features.fillna(0)
103    
104    def train(
105        self,
106        trades_df: pd.DataFrame,
107        labels: pd.Series
108    ):
109        """
110        Train classifier
111        
112        labels: 1 = informed, 0 = uninformed
113        """
114        features = self.extract_features(trades_df)
115        
116        # Remove NaN rows
117        mask = ~features.isna().any(axis=1)
118        features = features[mask]
119        labels = labels[mask]
120        
121        # Scale features
122        features_scaled = self.scaler.fit_transform(features)
123        
124        # Train model
125        self.model.fit(features_scaled, labels)
126        
127        # Feature importance
128        importance = pd.DataFrame({
129            'feature': features.columns,
130            'importance': self.model.feature_importances_
131        }).sort_values('importance', ascending=False)
132        
133        print("Feature Importance:")
134        print(importance)
135        
136    def predict_informed_probability(
137        self,
138        trades_df: pd.DataFrame
139    ) -> np.ndarray:
140        """Predict probability of informed flow"""
141        features = self.extract_features(trades_df)
142        features_scaled = self.scaler.transform(features.fillna(0))
143        
144        # Probability of class 1 (informed)
145        return self.model.predict_proba(features_scaled)[:, 1]
146
147class VPINCalculator:
148    """
149    Volume-Synchronized Probability of Informed Trading (VPIN)
150    
151    Measures order flow toxicity / information asymmetry
152    
153    High VPIN = high informed trading = risky to provide liquidity
154    """
155    
156    def __init__(self, bucket_size: int = 10000):
157        """
158        bucket_size: volume per bucket (e.g., 10k shares)
159        """
160        self.bucket_size = bucket_size
161        
162    def calculate_vpin(
163        self,
164        trades_df: pd.DataFrame,
165        num_buckets: int = 50
166    ) -> pd.DataFrame:
167        """
168        Calculate VPIN metric
169        
170        Algorithm:
171        1. Partition trades into equal-volume buckets
172        2. For each bucket, calculate |buy_volume - sell_volume|
173        3. VPIN = avg(|imbalance|) / avg(total_volume)
174        """
175        # Classify trades as buy or sell
176        trades_df = trades_df.copy()
177        trades_df['signed_volume'] = trades_df['size'] * trades_df['aggressor_side'].map({
178            'buy': 1,
179            'sell': -1,
180            'unknown': 0
181        })
182        
183        # Create volume buckets
184        trades_df['cumulative_volume'] = trades_df['size'].cumsum()
185        trades_df['bucket'] = (
186            trades_df['cumulative_volume'] // self.bucket_size
187        ).astype(int)
188        
189        # Calculate bucket statistics
190        bucket_stats = trades_df.groupby('bucket').agg({
191            'signed_volume': 'sum',
192            'size': 'sum',
193            'timestamp': 'last'
194        }).reset_index()
195        
196        bucket_stats['buy_volume'] = bucket_stats['signed_volume'].clip(lower=0)
197        bucket_stats['sell_volume'] = (-bucket_stats['signed_volume']).clip(lower=0)
198        bucket_stats['imbalance'] = (
199            bucket_stats['buy_volume'] - bucket_stats['sell_volume']
200        ).abs()
201        
202        # Calculate VPIN over rolling window
203        bucket_stats['vpin'] = (
204            bucket_stats['imbalance']
205            .rolling(window=num_buckets)
206            .sum()
207        ) / (
208            bucket_stats['size']
209            .rolling(window=num_buckets)
210            .sum()
211        )
212        
213        return bucket_stats[['timestamp', 'vpin', 'imbalance', 'size']]
214
215# Example: Generate labels for training
216def create_informed_labels(trades_df: pd.DataFrame) -> pd.Series:
217    """
218    Create training labels based on future price impact
219    
220    Informed trade = predicts future price movement
221    """
222    # Calculate future return (next 5 minutes)
223    trades_df = trades_df.copy()
224    trades_df['timestamp'] = pd.to_datetime(trades_df['timestamp'])
225    trades_df = trades_df.set_index('timestamp')
226    
227    # Future mid price
228    mid_price = (trades_df['bid'] + trades_df['ask']) / 2
229    future_return = mid_price.shift(-300) / mid_price - 1  # 5min forward
230    
231    # Trade direction
232    direction = trades_df['aggressor_side'].map({'buy': 1, 'sell': -1, 'unknown': 0})
233    
234    # Directionally correct if return matches trade direction
235    impact = future_return * direction
236    
237    # Label as informed if impact > threshold (e.g., 10bps)
238    threshold = 0.0010  # 10 bps
239    labels = (impact.abs() > threshold).astype(int)
240    
241    return labels
242

Price Impact Models #

Predict market impact for optimal execution.

python

1# price_impact.py
2import numpy as np
3import pandas as pd
4from scipy.optimize import minimize
5from sklearn.linear_model import LinearRegression
6
7class PriceImpactModel:
8    """
9    Price impact prediction models
10    
11    Permanent impact: long-term price change from information
12    Temporary impact: short-term price pressure that reverts
13    """
14    
15    def __init__(self):
16        self.permanent_model = None
17        self.temporary_model = None
18        
19    def almgren_chriss_impact(
20        self,
21        quantity: float,
22        participation_rate: float,
23        volatility: float,
24        adv: float,  # Average daily volume
25        eta: float = 0.1,  # Temporary impact coefficient
26        gamma: float = 0.1  # Permanent impact coefficient
27    ) -> Tuple[float, float]:
28        """
29        Almgren-Chriss impact model
30        
31        Temporary impact: η * σ * (v / V)
32        Permanent impact: γ * σ * (x / V)
33        
34        where:
35        - σ = volatility
36        - v = participation rate
37        - V = average volume
38        - x = quantity
39        """
40        # Normalize by ADV
41        quantity_normalized = quantity / adv
42        
43        # Temporary impact (bps)
44        temporary = eta * volatility * participation_rate * 10000
45        
46        # Permanent impact (bps)
47        permanent = gamma * volatility * quantity_normalized * 10000
48        
49        return permanent, temporary
50    
51    def fit_power_law_impact(
52        self,
53        executions_df: pd.DataFrame
54    ):
55        """
56        Fit power law impact model: impact = a * (quantity / adv)^b
57        
58        Typical: b ≈ 0.5 to 0.7
59        """
60        # Prepare data
61        X = np.log(executions_df['quantity'] / executions_df['adv'])
62        y = np.log(executions_df['impact'].abs())
63        
64        # Fit log-linear model
65        model = LinearRegression()
66        model.fit(X.values.reshape(-1, 1), y)
67        
68        a = np.exp(model.intercept_)
69        b = model.coef_[0]
70        
71        print(f"Power law fit: impact = {a:.4f} * (quantity/adv)^{b:.4f}")
72        
73        self.permanent_model = (a, b)
74        
75        return a, b
76    
77    def predict_impact(
78        self,
79        quantity: float,
80        adv: float,
81        model: str = 'power_law'
82    ) -> float:
83        """Predict price impact in bps"""
84        if model == 'power_law' and self.permanent_model:
85            a, b = self.permanent_model
86            return a * (quantity / adv) ** b
87        else:
88            # Fallback: square root model
89            return 10 * np.sqrt(quantity / adv) * 10000  # bps
90    
91    def optimal_execution_schedule(
92        self,
93        total_quantity: int,
94        time_horizon_minutes: int,
95        adv: float,
96        volatility: float,
97        risk_aversion: float = 0.01
98    ) -> pd.DataFrame:
99        """
100        Almgren-Chriss optimal execution
101        
102        Minimizes: E[cost] + λ * Var[cost]
103        
104        where cost = permanent impact + temporary impact + volatility risk
105        """
106        # Discretize time
107        num_intervals = time_horizon_minutes
108        dt = 1.0 / num_intervals  # Fraction of trading period
109        
110        # Almgren-Chriss parameters
111        sigma = volatility
112        eta = 0.1  # Temporary impact
113        gamma = 0.1  # Permanent impact
114        lamb = risk_aversion
115        
116        # Optimal trajectory (closed form solution)
117        kappa = np.sqrt(lamb * sigma**2 / eta)
118        tau = time_horizon_minutes / (252 * 390)  # Trading period in years
119        
120        # Trading schedule
121        schedule = []
122        remaining = total_quantity
123        
124        for n in range(num_intervals):
125            t = n * dt
126            
127            # Optimal trading rate (Almgren-Chriss formula)
128            sinh_term = np.sinh(kappa * tau * (1 - t))
129            sinh_total = np.sinh(kappa * tau)
130            
131            trade_quantity = remaining * (1 - sinh_term / sinh_total)
132            
133            schedule.append({
134                'interval': n,
135                'time_min': n,
136                'quantity': int(trade_quantity),
137                'remaining': int(remaining - trade_quantity)
138            })
139            
140            remaining -= trade_quantity
141        
142        return pd.DataFrame(schedule)
143
144# Production impact model
145class RealtimeImpactPredictor:
146    """Real-time impact prediction using live order book"""
147    
148    def __init__(self):
149        self.book_snapshot = None
150        
151    def update_book(self, bids: list, asks: list):
152        """Update order book snapshot"""
153        self.book_snapshot = {
154            'bids': sorted(bids, key=lambda x: -x[0]),  # Descending price
155            'asks': sorted(asks, key=lambda x: x[0])     # Ascending price
156        }
157    
158    def estimate_execution_price(
159        self,
160        quantity: int,
161        side: str  # 'buy' or 'sell'
162    ) -> Tuple[float, float]:
163        """
164        Estimate execution price by walking the book
165        
166        Returns: (avg_price, total_impact_bps)
167        """
168        if not self.book_snapshot:
169            return None, None
170        
171        levels = self.book_snapshot['asks'] if side == 'buy' else self.book_snapshot['bids']
172        
173        remaining = quantity
174        total_cost = 0.0
175        initial_price = levels[0][0] if levels else 0
176        
177        for price, size in levels:
178            if remaining <= 0:
179                break
180            
181            executed = min(remaining, size)
182            total_cost += executed * price
183            remaining -= executed
184        
185        if quantity > 0 and total_cost > 0:
186            avg_price = total_cost / (quantity - remaining)
187            impact_bps = ((avg_price - initial_price) / initial_price) * 10000
188            return avg_price, impact_bps
189        
190        return None, None
191

Production Metrics #

Our microstructure system (2024):

Queue Position Accuracy #

plaintext

1Queue Position Estimation:
2- Average error: 147 shares (12% of order size)
3- Fill probability accuracy: 84%
4- Orders saved from adverse fills: 2,847/month
5- P&L improvement: $124k/month
6
7Position-aware Strategy:
8- Join queue when ahead < 30%
9- Cancel and cross when ahead > 70%
10- Spread capture improvement: 42% → 58%
11

Flow Classification #

plaintext

1Informed Flow Detection:
2- Classification accuracy: 78%
3- Precision (informed): 82%
4- Recall (informed): 71%
5- F1 score: 0.76
6
7VPIN Monitoring:
8- High VPIN threshold: 0.35
9- Liquidity provision stopped: 234 times/day
10- Avoided adverse selection: $89k/day
11- False positives: 12%
12

Impact Prediction #

plaintext

1Price Impact Model:
2- R² score: 0.76
3- RMSE: 2.4 bps
4- Power law exponent: 0.58
5- Prediction horizon: 5 minutes
6
7Execution Cost:
8- Before impact model: 8.7 bps average
9- After optimization: 5.7 bps average
10- Annual savings: $3.2M
11- Slippage reduction: 34%
12

Lessons Learned #

After 2 years analyzing microstructure:

Queue position matters: Being ahead in queue worth 2-3 bps
Informed flow predictable: 78% accuracy with ML
VPIN works: High VPIN predicts adverse selection
Impact is square root: Power law exponent ~0.5-0.7
Book depth misleading: Hidden orders dominate
Time priority critical: Microseconds matter for queue position
Trade size reveals: Large orders are informed
Spread is noisy: True cost includes impact
Patience pays: Passive orders capture spread
Monitor always: Microstructure changes intraday

Understanding order flow is alpha.

Market Microstructure: Order Flow and Price Discovery

Why Market Microstructure Matters #

Queue Position Estimation #

Informed vs Uninformed Flow #

Price Impact Models #

Production Metrics #

Queue Position Accuracy #

Flow Classification #

Impact Prediction #

Lessons Learned #

Further Reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts

Market Microstructure: Order Flow and Price Discovery

Why Market Microstructure Matters #

Queue Position Estimation #

Informed vs Uninformed Flow #

Price Impact Models #

Production Metrics #

Queue Position Accuracy #

Flow Classification #

Impact Prediction #

Lessons Learned #

Further Reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts