NV
NordVarg
ServicesTechnologiesIndustriesCase StudiesBlogAboutContact
Get Started

Footer

NV
NordVarg

Software Development & Consulting

GitHubLinkedInTwitter

Services

  • Product Development
  • Quantitative Finance
  • Financial Systems
  • ML & AI

Technologies

  • C++
  • Python
  • Rust
  • OCaml
  • TypeScript
  • React

Company

  • About
  • Case Studies
  • Blog
  • Contact

© 2025 NordVarg. All rights reserved.

November 11, 2025
•
NordVarg Team
•

Practical C++ for Sub‑Microsecond Latency: Micro‑Optimizations That Actually Matter

PerformanceC++Low-LatencyPerformanceMicrobenchmarking
7 min read
Share:

Introduction#

If your system needs predictable, low tail latency (P99/P99.9), focus on data layout, allocation strategy, branch predictability, and measurement. This article gives a concise, practical cookbook of C++ techniques with runnable snippets and a microbenchmark harness you can adapt.

Key takeaways:

  • Measure percentiles, not means. P99+ is the target.
  • Avoid heap allocations on hot paths; use preallocated per-thread pools.
  • Keep hot data contiguous and small to fit L1/L2 caches when possible.
  • Avoid false sharing by aligning frequently-updated fields to cache lines.
  • Prefer branchless code or predictable branches on hot paths.
  • Use SIMD and software prefetching for predictable streaming workloads.
  • Tune compiler flags (LTO, -O3, -march) and pin threads to cores.

Audience and scope#

This is for senior systems engineers and C++ developers working on trading engines, networking, realtime analytics, or other latency-sensitive code. The examples are intentionally small and practical — they show patterns you can drop into an existing codebase and measure.

Measurement primer (the single most important step)#

Before changing code, measure. Use a simple harness that reports percentiles and inspect hardware counters for cache/TLB activity.

Minimal C++ microbenchmark harness (save as bench.h and include in small examples):

cpp
1// bench.h — minimal harness
2#pragma once
3#include <chrono>
4#include <vector>
5#include <algorithm>
6#include <functional>
7#include <cstdio>
8
9using ns = std::chrono::nanoseconds;
10using clk = std::chrono::high_resolution_clock;
11
12static std::vector<long long> run_many(std::function<void()> f, int runs = 200000) {
13    std::vector<long long> out; out.reserve(runs);
14    for (int i = 0; i < runs; ++i) {
15        auto t1 = clk::now();
16        f();
17        auto t2 = clk::now();
18        out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
19    }
20    std::sort(out.begin(), out.end());
21    return out;
22}
23
24static void report_percentiles(const std::vector<long long>& v) {
25    auto p = [&](double q){ return v[std::size_t(q * (v.size()-1))]; };
26    printf("p50=%lld ns p90=%lld ns p99=%lld ns p999=%lld ns p9999=%lld ns\n",
27           p(0.50), p(0.90), p(0.99), p(0.999), p(0.9999));
28}
29

Run with perf stat -e cycles,instructions,cache-misses,L1-dcache-loads ./bench for hardware counters. For production traces, use eBPF or perf record -g.

Custom allocators & memory pools (avoid malloc/free on the hot path)#

malloc implementations are improving, but dynamic allocation can still add jitter (page faults, per-thread caches, OS interactions). Use a simple per-thread pool for fixed-size allocations.

Example: small fixed-size per-thread arena (conceptual)

cpp
1// simple_pool.h
2#pragma once
3#include <vector>
4#include <cstddef>
5
6struct SimplePool {
7    std::vector<char*> slabs;
8    size_t slab_size;
9    size_t next_offset = 0;
10    char* current = nullptr;
11
12    explicit SimplePool(size_t slab_size_bytes = 1<<20) : slab_size(slab_size_bytes) {
13        refill();
14    }
15
16    void refill() {
17        current = (char*)::malloc(slab_size);
18        slabs.push_back(current);
19        next_offset = 0;
20    }
21
22    void* alloc(size_t n) {
23        if (next_offset + n > slab_size) refill();
24        void* p = current + next_offset;
25        next_offset += ((n + 15) & ~15); // 16-byte align
26        return p;
27    }
28
29    ~SimplePool() {
30        for (auto p : slabs) ::free(p);
31    }
32};
33
34// usage
35// thread_local SimplePool pool;
36// void* obj = pool.alloc(sizeof(MyObject));
37

Notes:

  • Use thread-local pools (thread_local) to avoid locking and NUMA cross-talk.
  • For production, prefer an existing allocator (tcmalloc/hoard/ jemalloc) tuned for your workload, or a battle-tested lock-free pool library.

Avoiding false sharing#

False sharing happens when two threads update different variables that share the same cache line. The cache coherency traffic kills latency.

Example (bad vs good):

cpp
1// bad
2struct Counters { uint64_t a; uint64_t b; };
3Counters c;
4// two threads: one updates c.a, the other updates c.b -> false sharing
5
6// good
7struct alignas(64) PaddedCounter { uint64_t v; };
8struct CountersPadded { PaddedCounter a; PaddedCounter b; };
9

When investigating, perf stat -e cache-misses,cache-references per-thread and perf top can show the hotspots.

Branch prediction — make hot paths predictable or branchless#

Mispredicted branches flush pipelines and cost tens of cycles. On hot loops, unpredictable branches produce long tails.

Simple comparison: branchy vs branchless sign function

cpp
1int sign_branch(int x) {
2    if (x > 0) return 1;
3    if (x < 0) return -1;
4    return 0;
5}
6
7int sign_branchless(int x) {
8    return (x > 0) - (x < 0);
9}
10

If input x is uniformly random, the branchy version produces many mispredictions. If inputs are skewed (mostly positive), branchy is fine. Use microbenchmarks to decide.

Another branchless trick (table lookup): replace a small switch/if chain with an array lookup when possible.

SIMD basics and when to use it#

SIMD (SSE/AVX) speeds up data-parallel work: parsing, aggregation, vector math. It's most effective when processing contiguous arrays.

Short example (sum of floats) using intrinsics (x86, AVX2):

cpp
1#include <immintrin.h>
2
3float sum_avx2(const float* a, size_t n) {
4    __m256 acc = _mm256_setzero_ps();
5    size_t i = 0;
6    for (; i + 8 <= n; i += 8) {
7        __m256 v = _mm256_loadu_ps(a + i);
8        acc = _mm256_add_ps(acc, v);
9    }
10    float tmp[8];
11    _mm256_storeu_ps(tmp, acc);
12    float s = tmp[0]+tmp[1]+tmp[2]+tmp[3]+tmp[4]+tmp[5]+tmp[6]+tmp[7];
13    for (; i < n; ++i) s += a[i];
14    return s;
15}
16

When to use SIMD:

  • Hot loops over contiguous numeric data.
  • When memory bandwidth and alignment are reasonable.

When not to use SIMD:

  • Pointer-chasing code or highly-branchy logic.
  • When code complexity is not justified by measurable gains.

Software prefetching and streaming patterns#

For streaming access patterns (processing messages, parsing arrays), prefetching reduces load stalls. Use __builtin_prefetch as a hint; tune the lookahead distance to your workload.

Example:

cpp
1for (size_t i = 0; i < n; ++i) {
2    if (i + 16 < n) __builtin_prefetch(&arr[i + 16]);
3    process(arr[i]);
4}
5

Avoid over-prefetching; it wastes bandwidth and pollutes caches.

Memory ordering, atomics, and fences — correctness with performance#

Use the C++ memory model: memory_order_relaxed for atomic counters when ordering isn't required; release/acquire for producer/consumer handoffs.

Example: producer-consumer flag

cpp
1std::atomic<bool> ready(false);
2std::vector<int> buf;
3
4void producer() {
5    buf.push_back(42);
6    ready.store(true, std::memory_order_release);
7}
8
9void consumer() {
10    while (!ready.load(std::memory_order_acquire)) {}
11    // safe to read buf
12}
13

seq_cst is the strongest and sometimes slower option — avoid it on ultra-hot paths unless necessary.

NUMA & affinity — pin threads and allocate local memory#

On multi-socket machines, allocate memory on the same NUMA node as the consumer thread and pin the thread to a core. Use numactl in production or pthread_setaffinity_np in code. Example run command:

bash
1# pin process to cores 8-11 and allocate memory on node 1
2numactl --cpunodebind=1 --membind=1 ./my_low_latency_app
3

Putting techniques together — a small tick processor example#

Scenario: a single-threaded parse + process loop sees P99 spikes due to allocations and cache churn. Refactor checklist:

  1. Replace per-message new/delete with a thread-local SimplePool.
  2. Parse messages into an on-stack or pool-allocated POD struct (no pointers).
  3. Use a contiguous ring buffer (SPSC) to hand off between NIC thread and processing thread.
  4. Pin both threads to dedicated cores and isolate interrupts to the NIC core.

Expected outcome: fewer page faults, smaller working set, reduced cache misses, and tighter P99.

Microbenchmark: measure allocator vs pool#

Small test idea: compare allocating a 64-byte struct using new vs allocation from SimplePool in a tight hot loop, reporting p50/p99.

(Adapt the bench.h harness above. Build with -O3 -march=native -flto for realistic results.)

Build flags and toolchain tips#

  • Use -O3 -march=native -flto for release builds (only when you can control deployment CPU families).
  • Use -g + -Og for development performance debugging; optimize only when measured.
  • Prefer -fno-omit-frame-pointer when profiling with perf to get accurate stacks.
  • Use objdump -d or Compiler Explorer for inspecting generated assembly in hot loops.

Common pitfalls and anti-patterns#

  • Premature optimization: profile before big changes.
  • Overuse of __builtin_prefetch: wrong distance causes harm.
  • Replacing clear code with complicated intrinsics early — prefer readable code until measured.
  • Ignoring tail metrics: optimizing mean latency that increases P99 is counterproductive for many systems.

Checklist: quick-read actions you can do today#

  • Add a percentile reporting harness to your CI for critical paths.
  • Replace new/delete on hot paths with a preallocated pool or slab allocator.
  • Scan for adjacent writes from different threads and add padding or restructure.
  • Pin hot threads to cores and isolate interrupts.
  • Replace unpredictable branches with branchless equivalents where microbenchmarks show benefit.
  • Add per-release benchmark runs (perf/eBPF) and store artifacts.

Further reading#

  • Intel 64 and IA-32 Architectures Optimization Reference Manual
  • Agner Fog — optimization manuals
  • Brendan Gregg — Linux Performance
  • "Computer Architecture: A Quantitative Approach" (Hennessy & Patterson)
NT

NordVarg Team

Technical Writer

NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.

C++Low-LatencyPerformanceMicrobenchmarking

Join 1,000+ Engineers

Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.

✓Weekly articles
✓Industry insights
✓No spam, ever

Related Posts

Nov 11, 2025•7 min read
CPU Internals for Software Engineers: Caches, Pipelines, and the Cost of a Branch
PerformanceCPUArchitecture
Oct 15, 2024•14 min read
Building Ultra-Low Latency Systems: The $10M Microsecond
How we reduced trading system latency from 500μs to 50μs—and why every microsecond matters in high-frequency trading
PerformanceC++Low-Latency
Nov 10, 2024•15 min read
Cross-Language Interfacing: Calling C/C++ from Rust, OCaml, and Python
Building high-performance systems by combining languages—practical patterns for FFI, safety, and zero-cost abstractions
PerformanceRustOCaml

Interested in working together?