Practical C++ for Sub‑Microsecond Latency: Micro‑Optimizations That Actually Matter

Introduction #

If your system needs predictable, low tail latency (P99/P99.9), focus on data layout, allocation strategy, branch predictability, and measurement. This article gives a concise, practical cookbook of C++ techniques with runnable snippets and a microbenchmark harness you can adapt.

Key takeaways:

Measure percentiles, not means. P99+ is the target.
Avoid heap allocations on hot paths; use preallocated per-thread pools.
Keep hot data contiguous and small to fit L1/L2 caches when possible.
Avoid false sharing by aligning frequently-updated fields to cache lines.
Prefer branchless code or predictable branches on hot paths.
Use SIMD and software prefetching for predictable streaming workloads.
Tune compiler flags (LTO, -O3, -march) and pin threads to cores.

Audience and scope #

This is for senior systems engineers and C++ developers working on trading engines, networking, realtime analytics, or other latency-sensitive code. The examples are intentionally small and practical — they show patterns you can drop into an existing codebase and measure.

Measurement primer (the single most important step)#

Before changing code, measure. Use a simple harness that reports percentiles and inspect hardware counters for cache/TLB activity.

Minimal C++ microbenchmark harness (save as bench.h and include in small examples):

cpp

1// bench.h — minimal harness
2#pragma once
3#include <chrono>
4#include <vector>
5#include <algorithm>
6#include <functional>
7#include <cstdio>
8
9using ns = std::chrono::nanoseconds;
10using clk = std::chrono::high_resolution_clock;
11
12static std::vector<long long> run_many(std::function<void()> f, int runs = 200000) {
13    std::vector<long long> out; out.reserve(runs);
14    for (int i = 0; i < runs; ++i) {
15        auto t1 = clk::now();
16        f();
17        auto t2 = clk::now();
18        out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
19    }
20    std::sort(out.begin(), out.end());
21    return out;
22}
23
24static void report_percentiles(const std::vector<long long>& v) {
25    auto p = [&](double q){ return v[std::size_t(q * (v.size()-1))]; };
26    printf("p50=%lld ns p90=%lld ns p99=%lld ns p999=%lld ns p9999=%lld ns\n",
27           p(0.50), p(0.90), p(0.99), p(0.999), p(0.9999));
28}
29

Run with perf stat -e cycles,instructions,cache-misses,L1-dcache-loads ./bench for hardware counters. For production traces, use eBPF or perf record -g.

Custom allocators & memory pools (avoid malloc/free on the hot path)#

malloc implementations are improving, but dynamic allocation can still add jitter (page faults, per-thread caches, OS interactions). Use a simple per-thread pool for fixed-size allocations.

Example: small fixed-size per-thread arena (conceptual)

cpp

1// simple_pool.h
2#pragma once
3#include <vector>
4#include <cstddef>
5
6struct SimplePool {
7    std::vector<char*> slabs;
8    size_t slab_size;
9    size_t next_offset = 0;
10    char* current = nullptr;
11
12    explicit SimplePool(size_t slab_size_bytes = 1<<20) : slab_size(slab_size_bytes) {
13        refill();
14    }
15
16    void refill() {
17        current = (char*)::malloc(slab_size);
18        slabs.push_back(current);
19        next_offset = 0;
20    }
21
22    void* alloc(size_t n) {
23        if (next_offset + n > slab_size) refill();
24        void* p = current + next_offset;
25        next_offset += ((n + 15) & ~15); // 16-byte align
26        return p;
27    }
28
29    ~SimplePool() {
30        for (auto p : slabs) ::free(p);
31    }
32};
33
34// usage
35// thread_local SimplePool pool;
36// void* obj = pool.alloc(sizeof(MyObject));
37

Notes:

Use thread-local pools (thread_local) to avoid locking and NUMA cross-talk.
For production, prefer an existing allocator (tcmalloc/hoard/ jemalloc) tuned for your workload, or a battle-tested lock-free pool library.

False sharing happens when two threads update different variables that share the same cache line. The cache coherency traffic kills latency.

Example (bad vs good):

cpp

1// bad
2struct Counters { uint64_t a; uint64_t b; };
3Counters c;
4// two threads: one updates c.a, the other updates c.b -> false sharing
5
6// good
7struct alignas(64) PaddedCounter { uint64_t v; };
8struct CountersPadded { PaddedCounter a; PaddedCounter b; };
9

When investigating, perf stat -e cache-misses,cache-references per-thread and perf top can show the hotspots.

Branch prediction — make hot paths predictable or branchless #

Mispredicted branches flush pipelines and cost tens of cycles. On hot loops, unpredictable branches produce long tails.

Simple comparison: branchy vs branchless sign function

cpp

1int sign_branch(int x) {
2    if (x > 0) return 1;
3    if (x < 0) return -1;
4    return 0;
5}
6
7int sign_branchless(int x) {
8    return (x > 0) - (x < 0);
9}
10

If input x is uniformly random, the branchy version produces many mispredictions. If inputs are skewed (mostly positive), branchy is fine. Use microbenchmarks to decide.

Another branchless trick (table lookup): replace a small switch/if chain with an array lookup when possible.

SIMD basics and when to use it #

SIMD (SSE/AVX) speeds up data-parallel work: parsing, aggregation, vector math. It's most effective when processing contiguous arrays.

Short example (sum of floats) using intrinsics (x86, AVX2):

cpp

1#include <immintrin.h>
2
3float sum_avx2(const float* a, size_t n) {
4    __m256 acc = _mm256_setzero_ps();
5    size_t i = 0;
6    for (; i + 8 <= n; i += 8) {
7        __m256 v = _mm256_loadu_ps(a + i);
8        acc = _mm256_add_ps(acc, v);
9    }
10    float tmp[8];
11    _mm256_storeu_ps(tmp, acc);
12    float s = tmp[0]+tmp[1]+tmp[2]+tmp[3]+tmp[4]+tmp[5]+tmp[6]+tmp[7];
13    for (; i < n; ++i) s += a[i];
14    return s;
15}
16

When to use SIMD:

Hot loops over contiguous numeric data.
When memory bandwidth and alignment are reasonable.

When not to use SIMD:

Pointer-chasing code or highly-branchy logic.
When code complexity is not justified by measurable gains.

Software prefetching and streaming patterns #

For streaming access patterns (processing messages, parsing arrays), prefetching reduces load stalls. Use __builtin_prefetch as a hint; tune the lookahead distance to your workload.

Example:

cpp

1for (size_t i = 0; i < n; ++i) {
2    if (i + 16 < n) __builtin_prefetch(&arr[i + 16]);
3    process(arr[i]);
4}
5

Avoid over-prefetching; it wastes bandwidth and pollutes caches.

Memory ordering, atomics, and fences — correctness with performance #

Use the C++ memory model: memory_order_relaxed for atomic counters when ordering isn't required; release/acquire for producer/consumer handoffs.

Example: producer-consumer flag

cpp

1std::atomic<bool> ready(false);
2std::vector<int> buf;
3
4void producer() {
5    buf.push_back(42);
6    ready.store(true, std::memory_order_release);
7}
8
9void consumer() {
10    while (!ready.load(std::memory_order_acquire)) {}
11    // safe to read buf
12}
13

seq_cst is the strongest and sometimes slower option — avoid it on ultra-hot paths unless necessary.

NUMA & affinity — pin threads and allocate local memory #

On multi-socket machines, allocate memory on the same NUMA node as the consumer thread and pin the thread to a core. Use numactl in production or pthread_setaffinity_np in code. Example run command:

bash

1# pin process to cores 8-11 and allocate memory on node 1
2numactl --cpunodebind=1 --membind=1 ./my_low_latency_app
3

Putting techniques together — a small tick processor example #

Scenario: a single-threaded parse + process loop sees P99 spikes due to allocations and cache churn. Refactor checklist:

Replace per-message new/delete with a thread-local SimplePool.
Parse messages into an on-stack or pool-allocated POD struct (no pointers).
Use a contiguous ring buffer (SPSC) to hand off between NIC thread and processing thread.
Pin both threads to dedicated cores and isolate interrupts to the NIC core.

Expected outcome: fewer page faults, smaller working set, reduced cache misses, and tighter P99.

Microbenchmark: measure allocator vs pool #

Small test idea: compare allocating a 64-byte struct using new vs allocation from SimplePool in a tight hot loop, reporting p50/p99.

(Adapt the bench.h harness above. Build with -O3 -march=native -flto for realistic results.)

Build flags and toolchain tips #

Use -O3 -march=native -flto for release builds (only when you can control deployment CPU families).
Use -g + -Og for development performance debugging; optimize only when measured.
Prefer -fno-omit-frame-pointer when profiling with perf to get accurate stacks.
Use objdump -d or Compiler Explorer for inspecting generated assembly in hot loops.

Common pitfalls and anti-patterns #

Premature optimization: profile before big changes.
Overuse of __builtin_prefetch: wrong distance causes harm.
Replacing clear code with complicated intrinsics early — prefer readable code until measured.
Ignoring tail metrics: optimizing mean latency that increases P99 is counterproductive for many systems.

Checklist: quick-read actions you can do today #

Add a percentile reporting harness to your CI for critical paths.
Replace new/delete on hot paths with a preallocated pool or slab allocator.
Scan for adjacent writes from different threads and add padding or restructure.
Pin hot threads to cores and isolate interrupts.
Replace unpredictable branches with branchless equivalents where microbenchmarks show benefit.
Add per-release benchmark runs (perf/eBPF) and store artifacts.

Introduction #

Key takeaways:

Measure percentiles, not means. P99+ is the target.
Avoid heap allocations on hot paths; use preallocated per-thread pools.
Keep hot data contiguous and small to fit L1/L2 caches when possible.
Avoid false sharing by aligning frequently-updated fields to cache lines.
Prefer branchless code or predictable branches on hot paths.
Use SIMD and software prefetching for predictable streaming workloads.
Tune compiler flags (LTO, -O3, -march) and pin threads to cores.

Audience and scope #

Measurement primer (the single most important step)#

Before changing code, measure. Use a simple harness that reports percentiles and inspect hardware counters for cache/TLB activity.

Minimal C++ microbenchmark harness (save as bench.h and include in small examples):

cpp

1// bench.h — minimal harness
2#pragma once
3#include <chrono>
4#include <vector>
5#include <algorithm>
6#include <functional>
7#include <cstdio>
8
9using ns = std::chrono::nanoseconds;
10using clk = std::chrono::high_resolution_clock;
11
12static std::vector<long long> run_many(std::function<void()> f, int runs = 200000) {
13    std::vector<long long> out; out.reserve(runs);
14    for (int i = 0; i < runs; ++i) {
15        auto t1 = clk::now();
16        f();
17        auto t2 = clk::now();
18        out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
19    }
20    std::sort(out.begin(), out.end());
21    return out;
22}
23
24static void report_percentiles(const std::vector<long long>& v) {
25    auto p = [&](double q){ return v[std::size_t(q * (v.size()-1))]; };
26    printf("p50=%lld ns p90=%lld ns p99=%lld ns p999=%lld ns p9999=%lld ns\n",
27           p(0.50), p(0.90), p(0.99), p(0.999), p(0.9999));
28}
29

Run with perf stat -e cycles,instructions,cache-misses,L1-dcache-loads ./bench for hardware counters. For production traces, use eBPF or perf record -g.

Custom allocators & memory pools (avoid malloc/free on the hot path)#

malloc implementations are improving, but dynamic allocation can still add jitter (page faults, per-thread caches, OS interactions). Use a simple per-thread pool for fixed-size allocations.

Example: small fixed-size per-thread arena (conceptual)

cpp

1// simple_pool.h
2#pragma once
3#include <vector>
4#include <cstddef>
5
6struct SimplePool {
7    std::vector<char*> slabs;
8    size_t slab_size;
9    size_t next_offset = 0;
10    char* current = nullptr;
11
12    explicit SimplePool(size_t slab_size_bytes = 1<<20) : slab_size(slab_size_bytes) {
13        refill();
14    }
15
16    void refill() {
17        current = (char*)::malloc(slab_size);
18        slabs.push_back(current);
19        next_offset = 0;
20    }
21
22    void* alloc(size_t n) {
23        if (next_offset + n > slab_size) refill();
24        void* p = current + next_offset;
25        next_offset += ((n + 15) & ~15); // 16-byte align
26        return p;
27    }
28
29    ~SimplePool() {
30        for (auto p : slabs) ::free(p);
31    }
32};
33
34// usage
35// thread_local SimplePool pool;
36// void* obj = pool.alloc(sizeof(MyObject));
37

Notes:

Use thread-local pools (thread_local) to avoid locking and NUMA cross-talk.
For production, prefer an existing allocator (tcmalloc/hoard/ jemalloc) tuned for your workload, or a battle-tested lock-free pool library.

False sharing happens when two threads update different variables that share the same cache line. The cache coherency traffic kills latency.

Example (bad vs good):

cpp

1// bad
2struct Counters { uint64_t a; uint64_t b; };
3Counters c;
4// two threads: one updates c.a, the other updates c.b -> false sharing
5
6// good
7struct alignas(64) PaddedCounter { uint64_t v; };
8struct CountersPadded { PaddedCounter a; PaddedCounter b; };
9

When investigating, perf stat -e cache-misses,cache-references per-thread and perf top can show the hotspots.

Branch prediction — make hot paths predictable or branchless #

Mispredicted branches flush pipelines and cost tens of cycles. On hot loops, unpredictable branches produce long tails.

Simple comparison: branchy vs branchless sign function

cpp

1int sign_branch(int x) {
2    if (x > 0) return 1;
3    if (x < 0) return -1;
4    return 0;
5}
6
7int sign_branchless(int x) {
8    return (x > 0) - (x < 0);
9}
10

If input x is uniformly random, the branchy version produces many mispredictions. If inputs are skewed (mostly positive), branchy is fine. Use microbenchmarks to decide.

Another branchless trick (table lookup): replace a small switch/if chain with an array lookup when possible.

SIMD basics and when to use it #

SIMD (SSE/AVX) speeds up data-parallel work: parsing, aggregation, vector math. It's most effective when processing contiguous arrays.

Short example (sum of floats) using intrinsics (x86, AVX2):

cpp

1#include <immintrin.h>
2
3float sum_avx2(const float* a, size_t n) {
4    __m256 acc = _mm256_setzero_ps();
5    size_t i = 0;
6    for (; i + 8 <= n; i += 8) {
7        __m256 v = _mm256_loadu_ps(a + i);
8        acc = _mm256_add_ps(acc, v);
9    }
10    float tmp[8];
11    _mm256_storeu_ps(tmp, acc);
12    float s = tmp[0]+tmp[1]+tmp[2]+tmp[3]+tmp[4]+tmp[5]+tmp[6]+tmp[7];
13    for (; i < n; ++i) s += a[i];
14    return s;
15}
16

When to use SIMD:

Hot loops over contiguous numeric data.
When memory bandwidth and alignment are reasonable.

When not to use SIMD:

Pointer-chasing code or highly-branchy logic.
When code complexity is not justified by measurable gains.

Software prefetching and streaming patterns #

For streaming access patterns (processing messages, parsing arrays), prefetching reduces load stalls. Use __builtin_prefetch as a hint; tune the lookahead distance to your workload.

Example:

cpp

1for (size_t i = 0; i < n; ++i) {
2    if (i + 16 < n) __builtin_prefetch(&arr[i + 16]);
3    process(arr[i]);
4}
5

Avoid over-prefetching; it wastes bandwidth and pollutes caches.

Memory ordering, atomics, and fences — correctness with performance #

Use the C++ memory model: memory_order_relaxed for atomic counters when ordering isn't required; release/acquire for producer/consumer handoffs.

Example: producer-consumer flag

cpp

1std::atomic<bool> ready(false);
2std::vector<int> buf;
3
4void producer() {
5    buf.push_back(42);
6    ready.store(true, std::memory_order_release);
7}
8
9void consumer() {
10    while (!ready.load(std::memory_order_acquire)) {}
11    // safe to read buf
12}
13

seq_cst is the strongest and sometimes slower option — avoid it on ultra-hot paths unless necessary.

NUMA & affinity — pin threads and allocate local memory #

bash

1# pin process to cores 8-11 and allocate memory on node 1
2numactl --cpunodebind=1 --membind=1 ./my_low_latency_app
3

Putting techniques together — a small tick processor example #

Scenario: a single-threaded parse + process loop sees P99 spikes due to allocations and cache churn. Refactor checklist:

Replace per-message new/delete with a thread-local SimplePool.
Parse messages into an on-stack or pool-allocated POD struct (no pointers).
Use a contiguous ring buffer (SPSC) to hand off between NIC thread and processing thread.
Pin both threads to dedicated cores and isolate interrupts to the NIC core.

Expected outcome: fewer page faults, smaller working set, reduced cache misses, and tighter P99.

Microbenchmark: measure allocator vs pool #

Small test idea: compare allocating a 64-byte struct using new vs allocation from SimplePool in a tight hot loop, reporting p50/p99.

(Adapt the bench.h harness above. Build with -O3 -march=native -flto for realistic results.)

Build flags and toolchain tips #

Use -O3 -march=native -flto for release builds (only when you can control deployment CPU families).
Use -g + -Og for development performance debugging; optimize only when measured.
Prefer -fno-omit-frame-pointer when profiling with perf to get accurate stacks.
Use objdump -d or Compiler Explorer for inspecting generated assembly in hot loops.

Common pitfalls and anti-patterns #

Premature optimization: profile before big changes.
Overuse of __builtin_prefetch: wrong distance causes harm.
Replacing clear code with complicated intrinsics early — prefer readable code until measured.
Ignoring tail metrics: optimizing mean latency that increases P99 is counterproductive for many systems.

Checklist: quick-read actions you can do today #

Add a percentile reporting harness to your CI for critical paths.
Replace new/delete on hot paths with a preallocated pool or slab allocator.
Scan for adjacent writes from different threads and add padding or restructure.
Pin hot threads to cores and isolate interrupts.
Replace unpredictable branches with branchless equivalents where microbenchmarks show benefit.
Add per-release benchmark runs (perf/eBPF) and store artifacts.

Practical C++ for Sub‑Microsecond Latency: Micro‑Optimizations That Actually Matter

Introduction #

Audience and scope #

Measurement primer (the single most important step)#

Custom allocators & memory pools (avoid malloc/free on the hot path)#

Branch prediction — make hot paths predictable or branchless #

SIMD basics and when to use it #

Software prefetching and streaming patterns #

Memory ordering, atomics, and fences — correctness with performance #

NUMA & affinity — pin threads and allocate local memory #

Putting techniques together — a small tick processor example #

Microbenchmark: measure allocator vs pool #

Build flags and toolchain tips #

Common pitfalls and anti-patterns #

Checklist: quick-read actions you can do today #

Further reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts

Practical C++ for Sub‑Microsecond Latency: Micro‑Optimizations That Actually Matter

Introduction #

Audience and scope #

Measurement primer (the single most important step)#

Custom allocators & memory pools (avoid malloc/free on the hot path)#

Branch prediction — make hot paths predictable or branchless #

SIMD basics and when to use it #

Software prefetching and streaming patterns #

Memory ordering, atomics, and fences — correctness with performance #

NUMA & affinity — pin threads and allocate local memory #

Putting techniques together — a small tick processor example #

Microbenchmark: measure allocator vs pool #

Build flags and toolchain tips #

Common pitfalls and anti-patterns #

Checklist: quick-read actions you can do today #

Further reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts