Latency Optimization for C++ in HFT Trading — Practical Guide
A hands-on guide to profiling and optimizing latency in C++ trading code: hardware-aware design, kernel-bypass networking, lock-free queues, memory layout, and measurement best-practices.
A hands-on guide to profiling and optimizing latency in C++ trading code: hardware-aware design, kernel-bypass networking, lock-free queues, memory layout, and measurement best-practices.
Low latency isn't an academic trophy — it's a business lever. In HFT and market-making, microseconds (and often nanoseconds) change probability of execution, queue position and ultimately profitability. This article is a pragmatic, runnable guide for C++ engineers building low-latency trading systems. It focuses on the hot paths (market-data ingestion, order construction and send), measurement-first workflows, and techniques that have real-world payoff.
Optimizing the wrong metric wastes time. Tail latency (p99, p99.9, p99.99) is usually what matters in trading: the mean is interesting, but a low mean with fat tails still loses.
Define measurement semantics clearly: where timestamps are collected (userland vs NIC vs gateway) and how you correlate them end-to-end.
Always start with realistic load and reproducible tests.
A simple workflow:
perf record -F 997 -a -g -- ./app.Example flamegraph pipeline (on Linux):
1perf record -F 997 -a -g -- ./market_data_consumer
2perf script > out.perf
3./stackcollapse-perf.pl out.perf > out.folded
4./flamegraph.pl out.folded > flamegraph.svg
5For kernel & NIC interactions, bpftrace can track syscalls and interrupts without the sampling noise of perf.
Document your hot path carefully: identify which functions are on the critical path and how they interact with hardware.
Small changes at the hardware/OS layer give big wins. Typical items:
chrt and SCHED_FIFO for critical threadsExample: pinning a thread in C++:
1#include <pthread.h>
2
3void pin_thread(int cpu) {
4 cpu_set_t cpuset;
5 CPU_ZERO(&cpuset);
6 CPU_SET(cpu, &cpuset);
7 pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
8}
9Be careful: real-time priorities and CPU isolation require operational discipline. Use monitoring and fallbacks so misconfigured services can't take the machine.
For ultra-low-latency systems prefer deploying directly on bare metal rather than adding layers (VMs, containers) that can introduce jitter. The recommendations below assume you control the host and can apply kernel, BIOS, and service-level configuration.
Host ops checklist (commands to run as root or via sudo):
1# set performance CPU governor
2for cpu in /sys/devices/system/cpu/cpu[0-9]*; do
3 echo performance > "$cpu/cpufreq/scaling_governor" 2>/dev/null || true
4done
5
6# disable swap for deterministic memory behavior (or set swappiness low)
7swapoff -a
8sysctl -w vm.swappiness=1
9
10# configure hugepages (example 1G pages, adjust to your workload)
11sysctl -w vm.nr_hugepages=512
12mount -t hugetlbfs nodev /dev/hugepages || true
13
14# kernel bootline examples (edit /etc/default/grub and update-grub):
15# GRUB_CMDLINE_LINUX="quiet splash isolcpus=2-5 nohz_full=2-5 rcu_nocbs=2-5 mitigations=off"
16Systemd unit example (place in /etc/systemd/system/lowlat.service):
1[Unit]
2Description=Low-latency trading service
3After=network.target
4
5[Service]
6Type=simple
7ExecStart=/opt/lowlat/bin/trading_service --config /etc/lowlat/config.yaml
8Restart=on-failure
9# Give the process real-time niceness
10Nice=-10
11IOSchedulingClass=realtime
12LimitMEMLOCK=infinity
13LimitNOFILE=1048576
14TasksMax=infinity
15Delegate=yes
16
17[Install]
18WantedBy=multi-user.target
19Enable and start with:
1systemctl daemon-reload
2systemctl enable --now lowlat.service
3Notes:
Delegate=yes allows systemd to hand cgroups to the process; use it if your service manages cgroups or child processes.LimitMEMLOCK=infinity is necessary when you use mlock/hugepages.For multi-socket machines, memory locality matters. Common rules:
numactl to control allocation, or manually bind with mbind/mmap flagsExample numactl usage for a service:
1numactl --cpunodebind=0 --membind=0 ./trading_service
2Network latency is often the dominant factor. Techniques:
If you do use kernel-bypass, isolate the NIC on dedicated cores and keep the code path minimal and deterministic.
Make memory friendly for cache and CPU pipelines.
Example: an AoS -> SoA transformation sketch
1struct OrderAoS { uint64_t id; double price; int qty; };
2// vs
3struct OrdersSoA {
4 std::vector<uint64_t> id;
5 std::vector<double> price;
6 std::vector<int> qty;
7};
8Benchmarks determine which layout is best for your access pattern.
On the hot path, locking kills tail latency. Alternatives:
Minimal SPSC ring buffer example (single-producer single-consumer, minimal):
1// spsc_ring.hpp — header-only minimal SPSC ring
2#include <atomic>
3#include <vector>
4
5template<typename T>
6class SPSCQueue {
7public:
8 SPSCQueue(size_t capacity) : cap_(capacity+1), buf_(cap_) {}
9
10 bool push(const T &v) {
11 size_t n = (head_.load(std::memory_order_relaxed) + 1) % cap_;
12 if (n == tail_.load(std::memory_order_acquire)) return false; // full
13 buf_[head_.load(std::memory_order_relaxed)] = v;
14 head_.store(n, std::memory_order_release);
15 return true;
16 }
17
18 bool pop(T &out) {
19 auto t = tail_.load(std::memory_order_relaxed);
20 if (t == head_.load(std::memory_order_acquire)) return false; // empty
21 out = buf_[t];
22 tail_.store((t + 1) % cap_, std::memory_order_release);
23 return true;
24 }
25
26private:
27 const size_t cap_;
28 std::vector<T> buf_;
29 std::atomic<size_t> head_{0}, tail_{0};
30};
31This SPSC queue is low-overhead and well-suited for hand-off between a network parsing thread and a processing thread.
Avoid copying on the hot path. Patterns:
For protocol encoding, minimal templates and manual packing often outperform heavy libraries.
-O3 -march=native -flto -fno-exceptions -fno-rtti where appropriate in hot modules.PGO quick steps (gcc/clang):
1# 1. Build with instrumentation
2CXXFLAGS='-fprofile-generate -O2' make
3# 2. Run realistic workload to generate .profraw files
4./run_workload
5# 3. Rebuild with use
6CXXFLAGS='-fprofile-use -O3' make
7Measure before/after to avoid regressing tail latency.
Microbenchmarks isolate a technique's potential. End-to-end tests measure real-world impact. Use both.
Example microbenchmark harness (simplified): it spins up a producer that writes messages into the SPSC queue and a consumer that timestamps them and records latencies. Use high-resolution clocks and export histogram data (hdrhistogram is a good choice).
A small case study: parsing/unpacking market data then pushing into a consumer queue.
Typical before/after improvements (example numbers, illustrative):
Concrete numbers depend on workload and hardware; use the measurement recipe above.
Document and assert assumptions in code (static assertions, config checks).
Below are minimal, reproducible artifacts you can use as a starting point. Copy these into tools/latency-harness/ in your repo to iterate.
1CXX = g++
2CXXFLAGS = -O3 -march=native -flto -fno-exceptions -std=c++20 -g
3LDFLAGS = -flto
4
5all: harness
6
7harness: harness.o spsc_ring.o
8 $(CXX) $(LDFLAGS) -o harness harness.o spsc_ring.o
9
10%.o: %.cpp
11 $(CXX) $(CXXFLAGS) -c $< -o $@
121# run the service under load (in a separate terminal)
2./harness --producer --rate=1000000
3# in another terminal
4perf record -F 997 -p $(pidof harness) -g -- sleep 30
5perf script > out.perf
6stackcollapse-perf.pl out.perf > out.folded
7flamegraph.pl out.folded > flamegraph.svg
8Below is a minimal, self-contained microbenchmark you can copy into tools/latency-harness/. It uses a small header-only SPSC ring (spsc_ring.hpp) and a harness (harness.cpp) that measures one-way latencies (producer timestamp to consumer observation) and prints basic percentiles.
Place spsc_ring.hpp and harness.cpp next to the Makefile above and build with the provided Makefile.
spsc_ring.hpp
1// spsc_ring.hpp
2#pragma once
3#include <atomic>
4#include <cstddef>
5#include <vector>
6
7template<typename T>
8class SPSCQueue {
9public:
10 explicit SPSCQueue(size_t capacity)
11 : cap_(capacity + 1), buf_(cap_) {}
12
13 bool push(const T &v) {
14 const size_t head = head_.load(std::memory_order_relaxed);
15 const size_t next = increment(head);
16 if (next == tail_.load(std::memory_order_acquire)) return false; // full
17 buf_[head] = v;
18 head_.store(next, std::memory_order_release);
19 return true;
20 }
21
22 bool pop(T &out) {
23 const size_t tail = tail_.load(std::memory_order_relaxed);
24 if (tail == head_.load(std::memory_order_acquire)) return false; // empty
25 out = buf_[tail];
26 tail_.store(increment(tail), std::memory_order_release);
27 return true;
28 }
29
30private:
31 size_t increment(size_t i) const noexcept { return (i + 1) % cap_; }
32 const size_t cap_;
33 std::vector<T> buf_;
34 std::atomic<size_t> head_{0};
35 std::atomic<size_t> tail_{0};
36};
37harness.cpp
1// harness.cpp
2#include "spsc_ring.hpp"
3#include <chrono>
4#include <iostream>
5#include <thread>
6#include <vector>
7#include <algorithm>
8
9struct Msg {
10 uint64_t seq;
11 uint64_t t_ns; // timestamp in ns produced by producer
12};
13
14using Clock = std::chrono::steady_clock;
15
16int main(int argc, char **argv) {
17 const size_t rounds = 1'000'000; // messages
18 const size_t qcap = 1 << 16;
19
20 SPSCQueue<Msg> q(qcap);
21 std::vector<uint64_t> latencies;
22 latencies.reserve(rounds);
23
24 std::thread consumer([&](){
25 Msg m;
26 size_t received = 0;
27 while (received < rounds) {
28 if (!q.pop(m)) continue; // spin
29 auto now = Clock::now();
30 uint64_t now_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(now.time_since_epoch()).count();
31 uint64_t lat = now_ns - m.t_ns;
32 latencies.push_back(lat);
33 ++received;
34 }
35 });
36
37 // Producer
38 for (size_t i = 0; i < rounds; ++i) {
39 Msg m;
40 m.seq = i;
41 auto now = Clock::now();
42 m.t_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(now.time_since_epoch()).count();
43 // spin until push succeeds
44 while (!q.push(m)) {}
45 }
46
47 consumer.join();
48
49 if (latencies.empty()) {
50 std::cerr << "No samples collected\n";
51 return 2;
52 }
53
54 std::sort(latencies.begin(), latencies.end());
55 auto pct = [&](double p) {
56 size_t idx = std::min(latencies.size() - 1, (size_t)((p / 100.0) * latencies.size()));
57 return latencies[idx];
58 };
59
60 uint64_t sum = 0;
61 for (auto v : latencies) sum += v;
62 double mean = double(sum) / latencies.size();
63
64 std::cout << "samples=" << latencies.size() << " mean_ns=" << (uint64_t)mean
65 << " p50_ns=" << pct(50.0)
66 << " p90_ns=" << pct(90.0)
67 << " p99_ns=" << pct(99.0)
68 << " p999_ns=" << pct(99.9)
69 << "\n";
70
71 return 0;
72}
73Build & run
1# build
2make
3
4# run the harness
5./harness
6Notes and improvements
hdr_histogram and export results to CSV or a rendered histogram.Latency optimization is an iterative discipline: measure, hypothesize, change, measure again. Start with simple improvements (pin threads, reduce allocations, SPSC queues) and only add complexity (kernel-bypass, DPDK) when the simpler fixes are insufficient.
Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.