NV
NordVarg
ServicesTechnologiesIndustriesCase StudiesBlogAboutContact
Get Started

Footer

NV
NordVarg

Software Development & Consulting

GitHubLinkedInTwitter

Services

  • Product Development
  • Quantitative Finance
  • Financial Systems
  • ML & AI

Technologies

  • C++
  • Python
  • Rust
  • OCaml
  • TypeScript
  • React

Company

  • About
  • Case Studies
  • Blog
  • Contact

© 2025 NordVarg. All rights reserved.

November 11, 2025
•
NordVarg Team
•

Latency Optimization for C++ in HFT Trading — Practical Guide

A hands-on guide to profiling and optimizing latency in C++ trading code: hardware-aware design, kernel-bypass networking, lock-free queues, memory layout, and measurement best-practices.

GeneralC++HFTlatencyperformancesystems
12 min read
Share:

Latency Optimization for C++ in HFT Trading — Practical Guide#

Low latency isn't an academic trophy — it's a business lever. In HFT and market-making, microseconds (and often nanoseconds) change probability of execution, queue position and ultimately profitability. This article is a pragmatic, runnable guide for C++ engineers building low-latency trading systems. It focuses on the hot paths (market-data ingestion, order construction and send), measurement-first workflows, and techniques that have real-world payoff.

  • Audience: systems & quant engineers, senior C++ devs working on trading infra
  • Goal: give you a reproducible toolset and concrete code patterns to reduce tail latency

Quick summary#

  • Measure first. Optimizing without measurement is dangerous.
  • Optimize the whole stack: hardware, OS, NIC, memory, compiler and algorithms.
  • Prefer simple, deterministic techniques. Prefer correctness and testability over cleverness.

1. Metrics: what to measure and why#

Optimizing the wrong metric wastes time. Tail latency (p99, p99.9, p99.99) is usually what matters in trading: the mean is interesting, but a low mean with fat tails still loses.

  • Mean, median — gives central tendency
  • p95, p99, p99.9, p99.99 — tail behaviour; p99/p999 often used in HFT
  • Jitter — variability over time
  • Latency distribution over workloads (peak hours, bursts)
  • Throughput: messages/sec and useful for saturation checks

Define measurement semantics clearly: where timestamps are collected (userland vs NIC vs gateway) and how you correlate them end-to-end.


2. Measure & profile: tools and a workflow#

Always start with realistic load and reproducible tests.

  • Tools:
    • perf / perf record / perf script
    • bpftrace and eBPF (tracepoints, USDT)
    • flamegraphs (Brendan Gregg's tools)
    • custom timestamping using TSC (rdtsc) or high-res clocks + careful calibration
    • tcpdump + tshark for network-level traces

A simple workflow:

  1. Reproduce the issue (synthetic traffic with correct message patterns).
  2. Record CPU samples with perf record -F 997 -a -g -- ./app.
  3. Generate a flamegraph and inspect hot symbols.
  4. If network-side, capture packets and timestamps at both sender and receiver and correlate.

Example flamegraph pipeline (on Linux):

bash
1perf record -F 997 -a -g -- ./market_data_consumer
2perf script > out.perf
3./stackcollapse-perf.pl out.perf > out.folded
4./flamegraph.pl out.folded > flamegraph.svg
5

For kernel & NIC interactions, bpftrace can track syscalls and interrupts without the sampling noise of perf.


3. Hot paths: common sources of latency#

  • Market data parsing and message dispatch
  • Order serialization and send (syscall overhead)
  • NIC/driver wait times and interrupts
  • Cross-thread communication, queuing
  • Memory allocation and pointer chasing

Document your hot path carefully: identify which functions are on the critical path and how they interact with hardware.


4. Hardware & OS tuning#

Small changes at the hardware/OS layer give big wins. Typical items:

  • CPU pinning and IRQ affinity: pin threads to specific cores and move NIC interrupts to isolated cores
  • isolcpus / nohz_full: reduce kernel interference on hot cores
  • Real-time priorities (carefully): chrt and SCHED_FIFO for critical threads
  • Hugepages for large allocations to reduce TLB pressure
  • Disable power-saving features (C-states) on low-latency boxes

Example: pinning a thread in C++:

cpp
1#include <pthread.h>
2
3void pin_thread(int cpu) {
4  cpu_set_t cpuset;
5  CPU_ZERO(&cpuset);
6  CPU_SET(cpu, &cpuset);
7  pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
8}
9

Be careful: real-time priorities and CPU isolation require operational discipline. Use monitoring and fallbacks so misconfigured services can't take the machine.


5. Bare-metal deployment and service management#

For ultra-low-latency systems prefer deploying directly on bare metal rather than adding layers (VMs, containers) that can introduce jitter. The recommendations below assume you control the host and can apply kernel, BIOS, and service-level configuration.

  • Build a minimal, static or nearly-static binary for predictable startup and minimal runtime dependencies. Prefer c++ static linking where licensing and binary size allow (or use a minimal distro image if dynamic linking is required).
  • Package as a single artifact (tar.gz) that contains the binary, a small etc/ config, and a systemd unit to manage lifecycle. This keeps deploys reproducible without adding container layers.
  • Use systemd for service management but tune it for realtime workloads (see examples below). Running under systemd gives you supervision, logging, and controlled restart semantics while still running natively on the host.

Host ops checklist (commands to run as root or via sudo):

bash
1# set performance CPU governor
2for cpu in /sys/devices/system/cpu/cpu[0-9]*; do
3  echo performance > "$cpu/cpufreq/scaling_governor" 2>/dev/null || true
4done
5
6# disable swap for deterministic memory behavior (or set swappiness low)
7swapoff -a
8sysctl -w vm.swappiness=1
9
10# configure hugepages (example 1G pages, adjust to your workload)
11sysctl -w vm.nr_hugepages=512
12mount -t hugetlbfs nodev /dev/hugepages || true
13
14# kernel bootline examples (edit /etc/default/grub and update-grub):
15# GRUB_CMDLINE_LINUX="quiet splash isolcpus=2-5 nohz_full=2-5 rcu_nocbs=2-5 mitigations=off"
16

Systemd unit example (place in /etc/systemd/system/lowlat.service):

ini
1[Unit]
2Description=Low-latency trading service
3After=network.target
4
5[Service]
6Type=simple
7ExecStart=/opt/lowlat/bin/trading_service --config /etc/lowlat/config.yaml
8Restart=on-failure
9# Give the process real-time niceness
10Nice=-10
11IOSchedulingClass=realtime
12LimitMEMLOCK=infinity
13LimitNOFILE=1048576
14TasksMax=infinity
15Delegate=yes
16
17[Install]
18WantedBy=multi-user.target
19

Enable and start with:

bash
1systemctl daemon-reload
2systemctl enable --now lowlat.service
3

Notes:

  • Delegate=yes allows systemd to hand cgroups to the process; use it if your service manages cgroups or child processes.
  • LimitMEMLOCK=infinity is necessary when you use mlock/hugepages.
  • If your process needs SCHED_FIFO, provide a wrapper script that sets the scheduling or give appropriate capabilities — be cautious and test thoroughly.


5. NUMA awareness#

For multi-socket machines, memory locality matters. Common rules:

  • Pin threads and allocate thread-owned data on the same NUMA node
  • Use numactl to control allocation, or manually bind with mbind/mmap flags
  • Use scalable memory allocators tuned for multi-threaded workloads (jemalloc, tcmalloc)

Example numactl usage for a service:

bash
1numactl --cpunodebind=0 --membind=0 ./trading_service
2

6. Networking: kernel-bypass and NIC features#

Network latency is often the dominant factor. Techniques:

  • Kernel-bypass frameworks: DPDK, PF_RING, netmap — reduce kernel traversal but increase complexity.
  • Use NIC features: RSS, flow director, hardware timestamping, TOE offload cautiously.
  • Make send path non-blocking and avoid syscalls on the hot path (batch sends when possible).

If you do use kernel-bypass, isolate the NIC on dedicated cores and keep the code path minimal and deterministic.


7. Data structures & memory layout#

Make memory friendly for cache and CPU pipelines.

  • Prefer SoA (structure of arrays) for heavy vectorized access; AoS for small object locality.
  • Reduce pointer chasing; avoid deep indirections in hot path.
  • Use prefetching only where measured beneficial.
  • Short-lived allocations are costly: prefer object pools and slab allocators.

Example: an AoS -> SoA transformation sketch

cpp
1struct OrderAoS { uint64_t id; double price; int qty; };
2// vs
3struct OrdersSoA {
4  std::vector<uint64_t> id;
5  std::vector<double> price;
6  std::vector<int> qty;
7};
8

Benchmarks determine which layout is best for your access pattern.


8. Synchronization: avoid blocking where possible#

On the hot path, locking kills tail latency. Alternatives:

  • SPSC ring buffers for single-producer single-consumer
  • MPMC lock-free queues with careful ABA handling
  • Sequence locks and RCU for read-mostly data
  • Avoid syscalls (futex) in the hot path

Minimal SPSC ring buffer example (single-producer single-consumer, minimal):

cpp
1// spsc_ring.hpp — header-only minimal SPSC ring
2#include <atomic>
3#include <vector>
4
5template<typename T>
6class SPSCQueue {
7public:
8  SPSCQueue(size_t capacity) : cap_(capacity+1), buf_(cap_) {}
9
10  bool push(const T &v) {
11    size_t n = (head_.load(std::memory_order_relaxed) + 1) % cap_;
12    if (n == tail_.load(std::memory_order_acquire)) return false; // full
13    buf_[head_.load(std::memory_order_relaxed)] = v;
14    head_.store(n, std::memory_order_release);
15    return true;
16  }
17
18  bool pop(T &out) {
19    auto t = tail_.load(std::memory_order_relaxed);
20    if (t == head_.load(std::memory_order_acquire)) return false; // empty
21    out = buf_[t];
22    tail_.store((t + 1) % cap_, std::memory_order_release);
23    return true;
24  }
25
26private:
27  const size_t cap_;
28  std::vector<T> buf_;
29  std::atomic<size_t> head_{0}, tail_{0};
30};
31

This SPSC queue is low-overhead and well-suited for hand-off between a network parsing thread and a processing thread.


9. Serialization & zero-copy#

Avoid copying on the hot path. Patterns:

  • Zero-copy parsing: parse in-place on a pre-allocated buffer
  • Offsets instead of copies for string-like fields
  • Lightweight binary formats for network (in-place parsing rather than stream parsers)

For protocol encoding, minimal templates and manual packing often outperform heavy libraries.


10. Compiler & build optimizations#

  • Use -O3 -march=native -flto -fno-exceptions -fno-rtti where appropriate in hot modules.
  • Use Profile-Guided Optimization (PGO) for real improvement on hot paths.
  • Link-Time Optimization (LTO) can improve inlining across translation units.

PGO quick steps (gcc/clang):

bash
1# 1. Build with instrumentation
2CXXFLAGS='-fprofile-generate -O2' make
3# 2. Run realistic workload to generate .profraw files
4./run_workload
5# 3. Rebuild with use
6CXXFLAGS='-fprofile-use -O3' make
7

Measure before/after to avoid regressing tail latency.


11. Measurement & validation: microbenchmarks vs end-to-end#

Microbenchmarks isolate a technique's potential. End-to-end tests measure real-world impact. Use both.

  • Microbenchmarks: small harnesses that measure one algorithm or data-structure
  • End-to-end: full pipeline under synthetic but realistic load

Example microbenchmark harness (simplified): it spins up a producer that writes messages into the SPSC queue and a consumer that timestamps them and records latencies. Use high-resolution clocks and export histogram data (hdrhistogram is a good choice).


12. Example case study & before/after#

A small case study: parsing/unpacking market data then pushing into a consumer queue.

  • Baseline: naive parser using separate allocations, std::queue, and mallocs
  • Optimized: in-place parser, SPSC queue, preallocated buffers, pinned threads

Typical before/after improvements (example numbers, illustrative):

  • Mean latency: 420 µs -> 48 µs
  • p99: 4.2 ms -> 0.27 ms

Concrete numbers depend on workload and hardware; use the measurement recipe above.


13. Risk & trade-offs#

  • Complexity: kernel-bypass and lock-free code are harder to maintain
  • Debugging: realtime and kernel-bypass paths are harder to introspect
  • Portability: some optimizations depend on hardware and kernel

Document and assert assumptions in code (static assertions, config checks).


14. Reproducible artifacts (what I'm including)#

Below are minimal, reproducible artifacts you can use as a starting point. Copy these into tools/latency-harness/ in your repo to iterate.

Build flags snippet (Makefile)#

Makefile
1CXX = g++
2CXXFLAGS = -O3 -march=native -flto -fno-exceptions -std=c++20 -g
3LDFLAGS = -flto
4
5all: harness
6
7harness: harness.o spsc_ring.o
8	$(CXX) $(LDFLAGS) -o harness harness.o spsc_ring.o
9
10%.o: %.cpp
11	$(CXX) $(CXXFLAGS) -c $< -o $@
12

Perf + flamegraph steps#

bash
1# run the service under load (in a separate terminal)
2./harness --producer --rate=1000000
3# in another terminal
4perf record -F 997 -p $(pidof harness) -g -- sleep 30
5perf script > out.perf
6stackcollapse-perf.pl out.perf > out.folded
7flamegraph.pl out.folded > flamegraph.svg
8

Microbenchmark harness (SPSC + harness)#

Below is a minimal, self-contained microbenchmark you can copy into tools/latency-harness/. It uses a small header-only SPSC ring (spsc_ring.hpp) and a harness (harness.cpp) that measures one-way latencies (producer timestamp to consumer observation) and prints basic percentiles.

Place spsc_ring.hpp and harness.cpp next to the Makefile above and build with the provided Makefile.

spsc_ring.hpp

cpp
1// spsc_ring.hpp
2#pragma once
3#include <atomic>
4#include <cstddef>
5#include <vector>
6
7template<typename T>
8class SPSCQueue {
9public:
10  explicit SPSCQueue(size_t capacity)
11    : cap_(capacity + 1), buf_(cap_) {}
12
13  bool push(const T &v) {
14    const size_t head = head_.load(std::memory_order_relaxed);
15    const size_t next = increment(head);
16    if (next == tail_.load(std::memory_order_acquire)) return false; // full
17    buf_[head] = v;
18    head_.store(next, std::memory_order_release);
19    return true;
20  }
21
22  bool pop(T &out) {
23    const size_t tail = tail_.load(std::memory_order_relaxed);
24    if (tail == head_.load(std::memory_order_acquire)) return false; // empty
25    out = buf_[tail];
26    tail_.store(increment(tail), std::memory_order_release);
27    return true;
28  }
29
30private:
31  size_t increment(size_t i) const noexcept { return (i + 1) % cap_; }
32  const size_t cap_;
33  std::vector<T> buf_;
34  std::atomic<size_t> head_{0};
35  std::atomic<size_t> tail_{0};
36};
37

harness.cpp

cpp
1// harness.cpp
2#include "spsc_ring.hpp"
3#include <chrono>
4#include <iostream>
5#include <thread>
6#include <vector>
7#include <algorithm>
8
9struct Msg {
10  uint64_t seq;
11  uint64_t t_ns; // timestamp in ns produced by producer
12};
13
14using Clock = std::chrono::steady_clock;
15
16int main(int argc, char **argv) {
17  const size_t rounds = 1'000'000; // messages
18  const size_t qcap = 1 << 16;
19
20  SPSCQueue<Msg> q(qcap);
21  std::vector<uint64_t> latencies;
22  latencies.reserve(rounds);
23
24  std::thread consumer([&](){
25    Msg m;
26    size_t received = 0;
27    while (received < rounds) {
28      if (!q.pop(m)) continue; // spin
29      auto now = Clock::now();
30      uint64_t now_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(now.time_since_epoch()).count();
31      uint64_t lat = now_ns - m.t_ns;
32      latencies.push_back(lat);
33      ++received;
34    }
35  });
36
37  // Producer
38  for (size_t i = 0; i < rounds; ++i) {
39    Msg m;
40    m.seq = i;
41    auto now = Clock::now();
42    m.t_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(now.time_since_epoch()).count();
43    // spin until push succeeds
44    while (!q.push(m)) {}
45  }
46
47  consumer.join();
48
49  if (latencies.empty()) {
50    std::cerr << "No samples collected\n";
51    return 2;
52  }
53
54  std::sort(latencies.begin(), latencies.end());
55  auto pct = [&](double p) {
56    size_t idx = std::min(latencies.size() - 1, (size_t)((p / 100.0) * latencies.size()));
57    return latencies[idx];
58  };
59
60  uint64_t sum = 0;
61  for (auto v : latencies) sum += v;
62  double mean = double(sum) / latencies.size();
63
64  std::cout << "samples=" << latencies.size() << " mean_ns=" << (uint64_t)mean
65            << " p50_ns=" << pct(50.0)
66            << " p90_ns=" << pct(90.0)
67            << " p99_ns=" << pct(99.0)
68            << " p999_ns=" << pct(99.9)
69            << "\n";
70
71  return 0;
72}
73

Build & run

bash
1# build
2make
3
4# run the harness
5./harness
6

Notes and improvements

  • This harness is intentionally minimal and focuses on one-way latency from producer timestamp to consumer observation. It uses spinning for clarity and lowest-latency path; on real tests you may want to pin threads, set schedulers, enable hugepages, lock memory, and disable frequency scaling as described earlier in the article.
  • For more robust histograms, integrate hdr_histogram and export results to CSV or a rendered histogram.

Small checklist before deploying optimizations#

  • Make sure the test workload matches the production protocol.
  • Run experiments during maintenance windows.
  • Keep a rollback plan and preserve baseline artifacts and configs.

Conclusion#

Latency optimization is an iterative discipline: measure, hypothesize, change, measure again. Start with simple improvements (pin threads, reduce allocations, SPSC queues) and only add complexity (kernel-bypass, DPDK) when the simpler fixes are insufficient.

NT

NordVarg Team

Technical Writer

NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.

C++HFTlatencyperformancesystems

Join 1,000+ Engineers

Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.

✓Weekly articles
✓Industry insights
✓No spam, ever

Related Posts

Nov 11, 2025•8 min read
Use std::variant + std::visit to avoid virtual dispatch in C++
When the set of types is known ahead of time, prefer std::variant and visitors to eliminate virtual calls and improve performance and ownership semantics.
GeneralC++performance
Nov 11, 2025•8 min read
CRTP — Curiously Recurring Template Pattern in C++: elegant static polymorphism
How CRTP works, when to use it, policy/mixin patterns, C++20 improvements, pitfalls, and practical examples you can compile and run.
Generalc++patterns
Nov 10, 2025•15 min read
Building a High-Performance Message Queue: From Scratch
GeneralSystems ProgrammingPerformance

Interested in working together?