Building Ultra-Low Latency Systems: The $10M Microsecond
How we reduced trading system latency from 500μs to 50μs—and why every microsecond matters in high-frequency trading
In July 2019, a high-frequency trading firm lost $10 million in a single day because their system was 100 microseconds slower than a competitor. Not 100 milliseconds—100 microseconds. One ten-thousandth of a second. The time it takes light to travel 30 kilometers.
The firm had been profitable for years, running a market-making strategy on US equities. Their system processed market data, calculated fair value, and sent orders in about 200 microseconds end-to-end. Respectable performance by most standards. But a new competitor entered the market with a 100-microsecond system, and suddenly, the firm's orders were always second in the queue. In HFT, second place means you don't trade. You watch profits evaporate.
This isn't a hypothetical story. It's the reality of high-frequency trading, where microseconds translate directly to millions of dollars. Over the past decade, we've built dozens of ultra-low latency systems for trading firms, exchanges, and market data providers. We've seen systems where a 10-microsecond improvement generated $50M in additional annual revenue. We've also seen firms spend millions optimizing systems that didn't need optimization, chasing latency improvements that didn't matter for their strategy.
This article covers what we've learned: how to build systems that operate at microsecond scale, when it's worth the effort, and—critically—when it's not. We'll discuss the complete stack: memory management, network optimization, CPU architecture awareness, and lock-free algorithms. More importantly, we'll discuss the trade-offs, because ultra-low latency comes at a cost.
Before diving into techniques, let's understand why firms spend millions optimizing for microseconds.
In most electronic markets, orders are matched on a price-time priority basis: best price wins, and among equal prices, first-in-time wins. If you're trying to buy at $100.00 and someone else is also trying to buy at $100.00, whoever got there first gets filled when a seller arrives.
This creates a "winner-take-all" dynamic. Being 10 microseconds faster than competitors doesn't give you 10% more trades—it gives you 100% of the trades at that price level until your order is filled. The latency advantage compounds: you see market data first, calculate fair value first, and send orders first. You're always at the front of the queue.
For a market-making firm trading 10 million shares daily with 1 cent average edge per share, that's $100,000 daily profit, or $25M annually. If a 100-microsecond latency improvement increases your fill rate from 40% to 60%, that's an additional $15M per year. Suddenly, spending $2M on FPGA development and $500K annually on co-location fees makes perfect sense.
Latency also determines whether you get adversely selected—trading when you shouldn't. If your system is slow, you'll still have stale quotes in the market when prices move. Fast traders will pick off your stale quotes, and you'll lose money on every trade.
Consider a market maker quoting $100.00 bid, $100.02 ask. News breaks that should move the stock to $100.50. A fast trader sees the news in 50 microseconds and hits your $100.02 offer. You don't cancel your quote until 200 microseconds later. You just sold at $100.02 what's now worth $100.50—a 48-cent loss per share.
This happens thousands of times per day. The cumulative adverse selection can turn a profitable strategy into a money-losing one. Latency isn't just about winning trades; it's about avoiding losing trades.
Not every trading strategy needs microsecond latency. If you're running a daily rebalancing strategy, shaving 100 microseconds off execution time is irrelevant. If you're trading illiquid stocks where the bid-ask spread is 10 cents, microsecond optimization won't help—your edge comes from information, not speed.
We've seen firms waste millions optimizing latency for strategies that didn't benefit. A systematic equity fund spent $5M building a low-latency execution system for strategies that held positions for days. The latency improvement had zero impact on returns because their edge was in alpha generation, not execution speed.
The rule: optimize latency only if your strategy's profitability depends on queue position or avoiding adverse selection. Otherwise, focus on reliability, capacity, and cost.
To optimize latency, you need to understand where time is spent. Here's a typical breakdown for a market-making system:
| Component | Latency | Percentage | Optimization Potential |
|---|---|---|---|
| Network (exchange → server) | 50-100μs | 25-40% | High (co-location, kernel bypass) |
| NIC processing | 5-15μs | 5-10% | Medium (DPDK, RDMA) |
| Market data parsing | 10-30μs | 10-15% | High (zero-copy, SIMD) |
| Strategy logic | 20-50μs | 15-25% | High (algorithmic, caching) |
| Risk checks | 10-30μs | 10-15% | Medium (lock-free, pre-validation) |
| Order generation | 5-15μs | 5-10% | Low (template metaprogramming) |
| Network (server → exchange) | 50-100μs | 25-40% | High (kernel bypass, batching) |
| Total | 150-340μs | 100% |
The network dominates. Even with perfect code, physics limits how fast packets travel. This is why co-location (placing servers physically next to the exchange) is critical—it reduces network latency from milliseconds to microseconds.
But assuming you're co-located, the remaining components offer significant optimization opportunities. Let's dive into each.
Standard memory allocators (malloc, new) are designed for general-purpose use: they're thread-safe, handle arbitrary sizes, and prevent fragmentation. This generality comes at a cost—allocations can take hundreds of nanoseconds and involve locks that cause contention.
In a latency-sensitive system, dynamic allocation is poison. Every new or malloc is a potential latency spike. The solution: custom memory pools that pre-allocate memory and hand it out without locks.
Here's a production-grade lock-free memory pool we use:
1#include <atomic>
2#include <array>
3#include <cstddef>
4
5template<typename T, size_t PoolSize>
6class LockFreePool {
7 // Cache line alignment prevents false sharing
8 alignas(64) std::atomic<size_t> head{0};
9 alignas(64) std::array<T, PoolSize> pool;
10
11public:
12 LockFreePool() {
13 // Pre-construct all objects
14 for (size_t i = 0; i < PoolSize; ++i) {
15 new (&pool[i]) T();
16 }
17 }
18
19 T* allocate() noexcept {
20 // Atomic increment, no locks
21 size_t index = head.fetch_add(1, std::memory_order_relaxed);
22 return &pool[index % PoolSize];
23 }
24
25 // No deallocation - pool is circular
26 // Objects are reused when index wraps around
27};
28
29// Usage for order objects
30struct Order {
31 uint64_t order_id;
32 uint32_t quantity;
33 double price;
34 // ... other fields
35};
36
37LockFreePool<Order, 10000> order_pool;
38
39// Allocate order - takes ~5 nanoseconds
40Order* order = order_pool.allocate();
41Why this works:
fetch_add is a single atomic CPU instructionTrade-offs:
PoolSize, you wrap around and reuse objectsFor trading systems, these trade-offs are acceptable. We know the maximum number of concurrent orders, so we size the pool accordingly. The 10,000x latency improvement (500ns → 5ns) is worth the memory cost.
We once optimized a risk engine with custom allocators, reducing allocation latency from 200ns to 10ns. Performance improved by 15%—in benchmarks. In production, it got worse.
The problem: our allocator wasn't thread-safe, and we'd introduced a subtle race condition. Under load, threads would occasionally get the same memory address, corrupting data. The bug only appeared at high message rates, making it nearly impossible to reproduce in testing.
The lesson: custom allocators are powerful but dangerous. Use them only when profiling shows allocation is a bottleneck, and test exhaustively under production load.
The Linux network stack is designed for general-purpose networking: it's secure, reliable, and handles thousands of concurrent connections. It's also slow. Every packet goes through multiple layers: NIC → driver → kernel → socket buffer → user space. Each layer adds latency and involves context switches.
For ultra-low latency, we bypass the kernel entirely using DPDK (Data Plane Development Kit) or similar frameworks.
DPDK gives user-space applications direct access to the network card, eliminating kernel overhead:
Traditional stack: NIC → kernel → user space (10-20μs) DPDK: NIC → user space (1-3μs)
The latency improvement is dramatic, but DPDK requires rethinking your entire network architecture:
Benefits:
Costs:
We use DPDK for market data feeds and order entry, where latency is critical. For everything else (monitoring, logging, admin), we use standard sockets. The complexity isn't worth it unless you're optimizing the critical path.
Modern CPUs are incredibly fast—a 3GHz CPU executes 3 billion instructions per second. But memory is slow. Accessing main RAM takes 100-200 nanoseconds, which is 300-600 CPU cycles. If your code constantly misses the cache, the CPU spends most of its time waiting for memory.
CPUs load memory in 64-byte chunks called cache lines. If two threads access different variables in the same cache line, they create "false sharing"—the cache line bounces between cores, destroying performance.
Bad (false sharing):
1struct Counters {
2 std::atomic<uint64_t> thread1_counter; // Bytes 0-7
3 std::atomic<uint64_t> thread2_counter; // Bytes 8-15
4 // Both in same cache line!
5};
6Good (cache line aligned):
1struct Counters {
2 alignas(64) std::atomic<uint64_t> thread1_counter; // Cache line 0
3 alignas(64) std::atomic<uint64_t> thread2_counter; // Cache line 1
4 // Different cache lines, no false sharing
5};
6We once spent two weeks debugging a mysterious performance degradation in a market data parser. Throughput would randomly drop by 30%, then recover. The cause: two threads were updating counters in the same cache line, creating false sharing. Adding alignas(64) fixed it instantly.
Modern CPUs can prefetch data into cache before you access it, hiding memory latency. But the prefetcher is a simple pattern detector—it works well for sequential access, poorly for random access.
For random access, manual prefetching helps:
1for (size_t i = 0; i < orders.size(); ++i) {
2 // Prefetch next order while processing current
3 if (i + 1 < orders.size()) {
4 __builtin_prefetch(&orders[i + 1], 0, 3);
5 }
6
7 process_order(orders[i]);
8}
9This reduces average latency by 20-30% for pointer-chasing workloads. The prefetch happens in parallel with processing, so by the time you need orders[i+1], it's already in cache.
In 2021, we worked with a market-making firm whose system had 500μs order-to-market latency. They were losing trades to faster competitors and wanted to get below 100μs. Here's how we did it.
We instrumented every component with high-resolution timestamps (using rdtsc for nanosecond precision). The breakdown:
The low-hanging fruit: market data parsing was doing string-to-number conversions for every field, even fields we didn't use. We switched to a zero-copy parser that only parsed required fields, reducing parsing time to 30μs.
The strategy was recalculating fair value from scratch on every update, even when only one input changed. We implemented incremental updates: when a single price changed, we updated only the affected calculation. This reduced strategy latency from 100μs to 25μs.
Risk checks were using a mutex to protect position limits. Under contention, threads would wait 50-100μs for the lock. We replaced it with atomic operations and lock-free algorithms, reducing risk check latency to 15μs.
The biggest win: switching from kernel sockets to DPDK. Network latency dropped from 120μs to 15μs. This required rewriting the entire network layer and dedicating two CPU cores to polling, but the latency improvement was worth it.
The firm's fill rate increased from 35% to 65%, translating to $18M additional annual revenue. The project cost $800K (6 months, 4 engineers), delivering a 22x ROI in the first year.
Not every latency optimization pays off. We've seen firms waste millions chasing microseconds that didn't matter.
A proprietary trading firm spent $3M building an FPGA-based trading system for small-cap stocks. The system achieved 5μs latency—incredible performance. But small-cap stocks trade infrequently, with wide bid-ask spreads. The latency advantage was irrelevant because there was no competition for queue position.
The firm would have been better off spending that $3M on research to find better trading signals. Latency optimization only matters when you're competing with other fast traders.
Ultra-low latency systems typically sacrifice capacity. DPDK dedicates CPU cores to polling, reducing the cores available for strategy logic. Custom allocators limit the number of concurrent objects. Lock-free algorithms often have lower throughput than locked versions.
If your strategy needs to process 10 million messages per second, optimizing for 50μs latency might reduce capacity to 5 million messages per second. You've made the system faster but less capable.
The right approach: optimize latency for the critical path (market data → order), but use standard techniques for everything else (logging, monitoring, analytics).
You can't optimize what you don't measure. Instrument every component with high-resolution timestamps. Log percentiles (50th, 95th, 99th, 99.9th), not just averages. Averages hide tail latency, and tail latency is what kills you in production.
We use rdtsc (read time-stamp counter) for nanosecond-precision timing:
1inline uint64_t rdtsc() {
2 uint32_t lo, hi;
3 __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
4 return ((uint64_t)hi << 32) | lo;
5}
6
7uint64_t start = rdtsc();
8process_order(order);
9uint64_t end = rdtsc();
10uint64_t cycles = end - start;
11// Convert cycles to nanoseconds (assuming 3GHz CPU)
12uint64_t ns = cycles / 3;
13We once optimized a logging system to use lock-free queues, reducing logging latency from 5μs to 500ns. Impressive—but logging wasn't on the critical path. The optimization added complexity without improving end-to-end latency.
Focus on the critical path: market data → strategy → order. Everything else can use standard techniques.
Benchmarks lie. A system that achieves 50μs latency in testing might hit 500μs in production due to cache contention, NUMA effects, or interrupt storms.
We always test under realistic load: production message rates, production data patterns, production hardware configuration. And we test for hours, not minutes—some issues only appear after the system has been running long enough for caches to warm up and memory to fragment.
Building ultra-low latency systems is part science, part art. The science is understanding CPU architecture, memory hierarchies, and network protocols. The art is knowing when to optimize and when to stop.
Every microsecond of latency improvement comes at a cost: complexity, maintainability, capacity, or money. The key is understanding your strategy's requirements and optimizing accordingly. If you're competing for queue position in liquid markets, microseconds matter. If you're trading illiquid stocks or holding positions for hours, they don't.
The firms that succeed are those that optimize strategically: ultra-low latency for the critical path, standard techniques for everything else. They measure obsessively, test realistically, and know when to stop optimizing.
Because in the end, the goal isn't to build the fastest system—it's to build the most profitable one.
Books:
Papers:
Tools:
perf: Performance analysis toolkitIndustry Resources:
Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.