Smart Order Routers (SOR): Design and Strategy

Smart Order Routers (SORs) are a core execution component for multi-venue trading. They decide where and how to route orders to achieve best execution given constraints like latency, fees, fill probability, and regulatory requirements. This article explains SOR responsibilities, common routing algorithms, system architecture, testing strategies, and practical trade-offs you'll face when building one for production.

Who this is for #

Engineers building execution systems, quant developers designing execution strategies, and platform teams that operate low-latency routing infrastructure.

Prerequisites: basic market structure (order book, matching engines), networking and concurrency familiarity, and an understanding of order types.

Goals of an SOR #

A good SOR implements:

Best execution (by configured policy): minimize cost, minimize latency, maximize fill probability, or some weighted combination.
Robustness: safe fallbacks, retries, and idempotency to avoid duplicate fills.
Observability: metrics and traces to reason about routing decisions and outcomes.
Compliance: enforce regulatory constraints (e.g., MiFID best execution obligations) and internal risk limits.

High-level architecture #

A typical SOR is split into these components:

Market Data Feed Adapter(s): ingest L1/L2 prices and venue-specific order-type capabilities.
Venue Profiles & Gateways: per-venue latency, fees, minimum quantities, supported order types and special behaviors.
Routing Engine: core decision logic — takes an input order and produces a per-venue plan (split, sizes, order types, and timing).
Execution Orchestrator: submits orders, monitors acknowledgements and fills, and implements retry/fallback logic.
State & Audit Store: durable record of orders, attempts, ExecIDs, ClOrdIDs, and reconciled outcomes.
Telemetry & Feedback Loop: metrics for decision quality, slippage, fill rates, and latency.

Mermaid sketch (simplified):

mermaid

1flowchart LR
2  OrderIn[Client Order]
3  OrderIn --> Router[Routing Engine]
4  Router -->|Plan| Orch[Execution Orchestrator]
5  Orch --> GW1[Venue Gateway A]
6  Orch --> GW2[Venue Gateway B]
7  GW1 --> VenueA[Exchange A]
8  GW2 --> VenueB[Exchange B]
9  MarketData --> Router
10  MarketData --> Orch
11

Routing algorithms and heuristics #

Routing algorithms differ by complexity and goals. Here are common patterns:

Top-of-book (price-first)
- Route to the venue with the best visible price. Simple and fast.
- Limitations: doesn't account for hidden liquidity, fees, or historical fill probability.
Cost-based routing
- Score venues by (price + fees + estimated market impact). Route where the net expected cost is lowest.
- Requires fee models and impact models; more computation but better long-term outcomes.
Probabilistic routing / statistical models
- Use historical fill probabilities and execution likelihood to split or prioritize venues.
- Example: route 60% to venue A and 40% to venue B if A historically fills 60% faster for that size.
Time-sliced routing
- For large orders, slice across time (TWAP/VWAP style) and use the SOR to select venue per slice.
Smart peeking / IOC probing
- Small immediate-or-cancel (IOC) probe orders to detect hidden liquidity, then route the remainder.
- Beware of venue rules and potential signaling concerns.
Hybrid: multi-criteria optimization
- Combine price, fees, latency, and fill-probability into a unified score. Often implemented as a configurable utility function.

Example: Simple cost-based scoring function #

Pseudocode for venue scoring:

python

1def score_venue(venue, price, size, fees_model, impact_model):
2    fee = fees_model.estimate(venue, size)
3    impact = impact_model.estimate(venue, size)
4    latency_penalty = venue.latency_ms * LATENCY_WEIGHT
5    expected_fill_prob = venue.fill_probability(size)
6    expected_cost = (price + fee + impact) / expected_fill_prob + latency_penalty
7    return expected_cost
8

Sort venues by expected_cost and route to the cheapest until order is filled or constraints met.

Order slicing and atomicity #

Large parent orders are usually sliced into child orders. SOR must:

Respect minimum quantity/lot sizes per venue.
Track partial fills and route remaining quantity dynamically.
Ensure idempotency and correlation between ClOrdID and ExecIDs across retries.

Use a parent-child model: parent order metadata is stored and children carry a pointer to parent (parent_id). When a child fills, update parent state and re-run routing for remaining quantity.

Execution orchestration & reliability #

Orchestrator responsibilities:

Submit child orders with unique ClOrdIDs and persist them before sending.
Monitor acknowledgements (ACK/NAK), and ExecutionReports for fills.
Implement timeout-based retries, backoff, and venue-specific fallbacks.
Cancel outstanding child orders if parent is canceled.

Practical rules:

Persist child orders before sending (durable journal) to survive crashes.
Implement a retry policy with a capped number of attempts per venue.
When encountering ambiguous states (e.g., gateway offline), prefer conservative behavior (stop or route to secondary venues) rather than blindly retrying.

Feedback and learning loop #

A production SOR learns from historical outcomes:

Maintain per-venue statistics by symbol, size bucket, and time-of-day: fill rates, latencies, reject rates, and realized slippage.
Use these statistics as priors in probabilistic routing.
Periodically retrain or update models; keep a validation pipeline to detect regressions.

Metrics to monitor #

Critical SOR metrics:

Fill rate by venue and size bucket.
Average / tail slippage (realized vs expected price).
Execution latency (submit -> ack, submit -> first fill).
Cancel rate and error/reject counts per venue.
Parent-order completion time and child-order churn (number of child orders per parent).
Cost per executed share (fees + slippage) over time.

Instrument events with correlation ids (parent_id, child_id) and ensure traces can be reconstructed end-to-end.

Testing strategies #

Unit tests for scoring functions and deterministic routing logic.
Simulation environment: replay historical L1/L2 data and simulate venue behavior (latency, partial fills, rejections). Validate routing outcomes against baseline metrics.
Integration tests with sandbox FIX/REST endpoints provided by brokers.
Chaos tests: induce network partitions, delayed acknowledgements, and gateway failover to validate orchestration and idempotency.

Compliance and market-specific rules #

Some venues penalize excessive IOC probing or order churn. Add rate-limits and per-venue throttles.
Respect trading hours, short-selling rules, and exchange-specific flags (e.g., ATC / auction-order semantics).
For regulated markets, log decision rationale for each routed child order for auditability and best-execution reporting.

Performance considerations #

The hot path is scoring and decision-making: keep it O(Nvenues) with small N and use cached stats for quick decisions.
Avoid blocking I/O on the decision path — offload network I/O to the orchestrator and use fast in-memory queues for communication.
Keep internal representations compact; reuse buffers and avoid allocations on the hot path.

Example: simple orchestrator loop (conceptual)#

python

1# parent order arrives
2plan = routing_engine.plan(parent_order)
3for step in plan:
4    for venue, qty in step.items():
5        child = build_child(parent_order, venue, qty)
6        persist(child)
7        send_to_gateway(venue, child)
8
9# monitor fills
10while not parent_filled():
11    msg = poll_execution_reports()
12    apply(msg)
13    if parent_remaining() > 0:
14        replan_and_dispatch(parent)
15

Operational playbook #

On elevated rejects for a venue: switch to fallback venues and notify ops; throttle further submissions.
On unexpected latency spikes: measure internal queue delays, inspect NICs, and consider short-circuiting to lower-latency venues if policy allows.
On partial fills and slippage: run replays in a sandbox to reproduce routing outcomes and tune scoring models.

Final thoughts and trade-offs #

SORs are a place where quant models meet engineering. Good SOR design provides a clean separation: let the routing engine make policy decisions based on market data and learned priors, and let the execution orchestrator handle the messy realities of ACKs, retries, and persistence.

Start simple: implement a deterministic cost-based router with robust persistence and observability. Add probabilistic splitting and learning-based routing once you have stable metrics and replayable test harnesses.

Who this is for #

Engineers building execution systems, quant developers designing execution strategies, and platform teams that operate low-latency routing infrastructure.

Prerequisites: basic market structure (order book, matching engines), networking and concurrency familiarity, and an understanding of order types.

Goals of an SOR #

A good SOR implements:

Best execution (by configured policy): minimize cost, minimize latency, maximize fill probability, or some weighted combination.
Robustness: safe fallbacks, retries, and idempotency to avoid duplicate fills.
Observability: metrics and traces to reason about routing decisions and outcomes.
Compliance: enforce regulatory constraints (e.g., MiFID best execution obligations) and internal risk limits.

High-level architecture #

A typical SOR is split into these components:

Market Data Feed Adapter(s): ingest L1/L2 prices and venue-specific order-type capabilities.
Venue Profiles & Gateways: per-venue latency, fees, minimum quantities, supported order types and special behaviors.
Routing Engine: core decision logic — takes an input order and produces a per-venue plan (split, sizes, order types, and timing).
Execution Orchestrator: submits orders, monitors acknowledgements and fills, and implements retry/fallback logic.
State & Audit Store: durable record of orders, attempts, ExecIDs, ClOrdIDs, and reconciled outcomes.
Telemetry & Feedback Loop: metrics for decision quality, slippage, fill rates, and latency.

Mermaid sketch (simplified):

mermaid

1flowchart LR
2  OrderIn[Client Order]
3  OrderIn --> Router[Routing Engine]
4  Router -->|Plan| Orch[Execution Orchestrator]
5  Orch --> GW1[Venue Gateway A]
6  Orch --> GW2[Venue Gateway B]
7  GW1 --> VenueA[Exchange A]
8  GW2 --> VenueB[Exchange B]
9  MarketData --> Router
10  MarketData --> Orch
11

Routing algorithms and heuristics #

Routing algorithms differ by complexity and goals. Here are common patterns:

Top-of-book (price-first)
- Route to the venue with the best visible price. Simple and fast.
- Limitations: doesn't account for hidden liquidity, fees, or historical fill probability.
Cost-based routing
- Score venues by (price + fees + estimated market impact). Route where the net expected cost is lowest.
- Requires fee models and impact models; more computation but better long-term outcomes.
Probabilistic routing / statistical models
- Use historical fill probabilities and execution likelihood to split or prioritize venues.
- Example: route 60% to venue A and 40% to venue B if A historically fills 60% faster for that size.
Time-sliced routing
- For large orders, slice across time (TWAP/VWAP style) and use the SOR to select venue per slice.
Smart peeking / IOC probing
- Small immediate-or-cancel (IOC) probe orders to detect hidden liquidity, then route the remainder.
- Beware of venue rules and potential signaling concerns.
Hybrid: multi-criteria optimization
- Combine price, fees, latency, and fill-probability into a unified score. Often implemented as a configurable utility function.

Example: Simple cost-based scoring function #

Pseudocode for venue scoring:

python

1def score_venue(venue, price, size, fees_model, impact_model):
2    fee = fees_model.estimate(venue, size)
3    impact = impact_model.estimate(venue, size)
4    latency_penalty = venue.latency_ms * LATENCY_WEIGHT
5    expected_fill_prob = venue.fill_probability(size)
6    expected_cost = (price + fee + impact) / expected_fill_prob + latency_penalty
7    return expected_cost
8

Sort venues by expected_cost and route to the cheapest until order is filled or constraints met.

Order slicing and atomicity #

Large parent orders are usually sliced into child orders. SOR must:

Respect minimum quantity/lot sizes per venue.
Track partial fills and route remaining quantity dynamically.
Ensure idempotency and correlation between ClOrdID and ExecIDs across retries.

Use a parent-child model: parent order metadata is stored and children carry a pointer to parent (parent_id). When a child fills, update parent state and re-run routing for remaining quantity.

Execution orchestration & reliability #

Orchestrator responsibilities:

Submit child orders with unique ClOrdIDs and persist them before sending.
Monitor acknowledgements (ACK/NAK), and ExecutionReports for fills.
Implement timeout-based retries, backoff, and venue-specific fallbacks.
Cancel outstanding child orders if parent is canceled.

Practical rules:

Persist child orders before sending (durable journal) to survive crashes.
Implement a retry policy with a capped number of attempts per venue.
When encountering ambiguous states (e.g., gateway offline), prefer conservative behavior (stop or route to secondary venues) rather than blindly retrying.

Feedback and learning loop #

A production SOR learns from historical outcomes:

Maintain per-venue statistics by symbol, size bucket, and time-of-day: fill rates, latencies, reject rates, and realized slippage.
Use these statistics as priors in probabilistic routing.
Periodically retrain or update models; keep a validation pipeline to detect regressions.

Metrics to monitor #

Critical SOR metrics:

Fill rate by venue and size bucket.
Average / tail slippage (realized vs expected price).
Execution latency (submit -> ack, submit -> first fill).
Cancel rate and error/reject counts per venue.
Parent-order completion time and child-order churn (number of child orders per parent).
Cost per executed share (fees + slippage) over time.

Instrument events with correlation ids (parent_id, child_id) and ensure traces can be reconstructed end-to-end.

Testing strategies #

Unit tests for scoring functions and deterministic routing logic.
Simulation environment: replay historical L1/L2 data and simulate venue behavior (latency, partial fills, rejections). Validate routing outcomes against baseline metrics.
Integration tests with sandbox FIX/REST endpoints provided by brokers.
Chaos tests: induce network partitions, delayed acknowledgements, and gateway failover to validate orchestration and idempotency.

Compliance and market-specific rules #

Some venues penalize excessive IOC probing or order churn. Add rate-limits and per-venue throttles.
Respect trading hours, short-selling rules, and exchange-specific flags (e.g., ATC / auction-order semantics).
For regulated markets, log decision rationale for each routed child order for auditability and best-execution reporting.

Performance considerations #

The hot path is scoring and decision-making: keep it O(Nvenues) with small N and use cached stats for quick decisions.
Avoid blocking I/O on the decision path — offload network I/O to the orchestrator and use fast in-memory queues for communication.
Keep internal representations compact; reuse buffers and avoid allocations on the hot path.

Example: simple orchestrator loop (conceptual)#

python

1# parent order arrives
2plan = routing_engine.plan(parent_order)
3for step in plan:
4    for venue, qty in step.items():
5        child = build_child(parent_order, venue, qty)
6        persist(child)
7        send_to_gateway(venue, child)
8
9# monitor fills
10while not parent_filled():
11    msg = poll_execution_reports()
12    apply(msg)
13    if parent_remaining() > 0:
14        replan_and_dispatch(parent)
15

Operational playbook #

On elevated rejects for a venue: switch to fallback venues and notify ops; throttle further submissions.
On unexpected latency spikes: measure internal queue delays, inspect NICs, and consider short-circuiting to lower-latency venues if policy allows.
On partial fills and slippage: run replays in a sandbox to reproduce routing outcomes and tune scoring models.

Smart Order Routers (SOR): Design and Strategy

Who this is for #

Goals of an SOR #

High-level architecture #

Routing algorithms and heuristics #

Example: Simple cost-based scoring function #

Order slicing and atomicity #

Execution orchestration & reliability #

Feedback and learning loop #

Metrics to monitor #

Testing strategies #

Compliance and market-specific rules #

Performance considerations #

Example: simple orchestrator loop (conceptual)#

Operational playbook #

Final thoughts and trade-offs #

NordVarg Team

Join 1,000+ Engineers

Related Posts

Smart Order Routers (SOR): Design and Strategy

Who this is for #

Goals of an SOR #

High-level architecture #

Routing algorithms and heuristics #

Example: Simple cost-based scoring function #

Order slicing and atomicity #

Execution orchestration & reliability #

Feedback and learning loop #

Metrics to monitor #

Testing strategies #

Compliance and market-specific rules #

Performance considerations #

Example: simple orchestrator loop (conceptual)#

Operational playbook #

Final thoughts and trade-offs #

NordVarg Team

Join 1,000+ Engineers

Related Posts