If your system needs predictable, low tail latency (P99/P99.9), focus on data layout, allocation strategy, branch predictability, and measurement. This article gives a concise, practical cookbook of C++ techniques with runnable snippets and a microbenchmark harness you can adapt.
Key takeaways:
This is for senior systems engineers and C++ developers working on trading engines, networking, realtime analytics, or other latency-sensitive code. The examples are intentionally small and practical — they show patterns you can drop into an existing codebase and measure.
Before changing code, measure. Use a simple harness that reports percentiles and inspect hardware counters for cache/TLB activity.
Minimal C++ microbenchmark harness (save as bench.h and include in small examples):
1// bench.h — minimal harness
2#pragma once
3#include <chrono>
4#include <vector>
5#include <algorithm>
6#include <functional>
7#include <cstdio>
8
9using ns = std::chrono::nanoseconds;
10using clk = std::chrono::high_resolution_clock;
11
12static std::vector<long long> run_many(std::function<void()> f, int runs = 200000) {
13 std::vector<long long> out; out.reserve(runs);
14 for (int i = 0; i < runs; ++i) {
15 auto t1 = clk::now();
16 f();
17 auto t2 = clk::now();
18 out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
19 }
20 std::sort(out.begin(), out.end());
21 return out;
22}
23
24static void report_percentiles(const std::vector<long long>& v) {
25 auto p = [&](double q){ return v[std::size_t(q * (v.size()-1))]; };
26 printf("p50=%lld ns p90=%lld ns p99=%lld ns p999=%lld ns p9999=%lld ns\n",
27 p(0.50), p(0.90), p(0.99), p(0.999), p(0.9999));
28}
29Run with perf stat -e cycles,instructions,cache-misses,L1-dcache-loads ./bench for hardware counters. For production traces, use eBPF or perf record -g.
malloc implementations are improving, but dynamic allocation can still add jitter (page faults, per-thread caches, OS interactions). Use a simple per-thread pool for fixed-size allocations.
Example: small fixed-size per-thread arena (conceptual)
1// simple_pool.h
2#pragma once
3#include <vector>
4#include <cstddef>
5
6struct SimplePool {
7 std::vector<char*> slabs;
8 size_t slab_size;
9 size_t next_offset = 0;
10 char* current = nullptr;
11
12 explicit SimplePool(size_t slab_size_bytes = 1<<20) : slab_size(slab_size_bytes) {
13 refill();
14 }
15
16 void refill() {
17 current = (char*)::malloc(slab_size);
18 slabs.push_back(current);
19 next_offset = 0;
20 }
21
22 void* alloc(size_t n) {
23 if (next_offset + n > slab_size) refill();
24 void* p = current + next_offset;
25 next_offset += ((n + 15) & ~15); // 16-byte align
26 return p;
27 }
28
29 ~SimplePool() {
30 for (auto p : slabs) ::free(p);
31 }
32};
33
34// usage
35// thread_local SimplePool pool;
36// void* obj = pool.alloc(sizeof(MyObject));
37Notes:
thread_local) to avoid locking and NUMA cross-talk.False sharing happens when two threads update different variables that share the same cache line. The cache coherency traffic kills latency.
Example (bad vs good):
1// bad
2struct Counters { uint64_t a; uint64_t b; };
3Counters c;
4// two threads: one updates c.a, the other updates c.b -> false sharing
5
6// good
7struct alignas(64) PaddedCounter { uint64_t v; };
8struct CountersPadded { PaddedCounter a; PaddedCounter b; };
9When investigating, perf stat -e cache-misses,cache-references per-thread and perf top can show the hotspots.
Mispredicted branches flush pipelines and cost tens of cycles. On hot loops, unpredictable branches produce long tails.
Simple comparison: branchy vs branchless sign function
1int sign_branch(int x) {
2 if (x > 0) return 1;
3 if (x < 0) return -1;
4 return 0;
5}
6
7int sign_branchless(int x) {
8 return (x > 0) - (x < 0);
9}
10If input x is uniformly random, the branchy version produces many mispredictions. If inputs are skewed (mostly positive), branchy is fine. Use microbenchmarks to decide.
Another branchless trick (table lookup): replace a small switch/if chain with an array lookup when possible.
SIMD (SSE/AVX) speeds up data-parallel work: parsing, aggregation, vector math. It's most effective when processing contiguous arrays.
Short example (sum of floats) using intrinsics (x86, AVX2):
1#include <immintrin.h>
2
3float sum_avx2(const float* a, size_t n) {
4 __m256 acc = _mm256_setzero_ps();
5 size_t i = 0;
6 for (; i + 8 <= n; i += 8) {
7 __m256 v = _mm256_loadu_ps(a + i);
8 acc = _mm256_add_ps(acc, v);
9 }
10 float tmp[8];
11 _mm256_storeu_ps(tmp, acc);
12 float s = tmp[0]+tmp[1]+tmp[2]+tmp[3]+tmp[4]+tmp[5]+tmp[6]+tmp[7];
13 for (; i < n; ++i) s += a[i];
14 return s;
15}
16When to use SIMD:
When not to use SIMD:
For streaming access patterns (processing messages, parsing arrays), prefetching reduces load stalls. Use __builtin_prefetch as a hint; tune the lookahead distance to your workload.
Example:
1for (size_t i = 0; i < n; ++i) {
2 if (i + 16 < n) __builtin_prefetch(&arr[i + 16]);
3 process(arr[i]);
4}
5Avoid over-prefetching; it wastes bandwidth and pollutes caches.
Use the C++ memory model: memory_order_relaxed for atomic counters when ordering isn't required; release/acquire for producer/consumer handoffs.
Example: producer-consumer flag
1std::atomic<bool> ready(false);
2std::vector<int> buf;
3
4void producer() {
5 buf.push_back(42);
6 ready.store(true, std::memory_order_release);
7}
8
9void consumer() {
10 while (!ready.load(std::memory_order_acquire)) {}
11 // safe to read buf
12}
13seq_cst is the strongest and sometimes slower option — avoid it on ultra-hot paths unless necessary.
On multi-socket machines, allocate memory on the same NUMA node as the consumer thread and pin the thread to a core. Use numactl in production or pthread_setaffinity_np in code. Example run command:
1# pin process to cores 8-11 and allocate memory on node 1
2numactl --cpunodebind=1 --membind=1 ./my_low_latency_app
3Scenario: a single-threaded parse + process loop sees P99 spikes due to allocations and cache churn. Refactor checklist:
new/delete with a thread-local SimplePool.Expected outcome: fewer page faults, smaller working set, reduced cache misses, and tighter P99.
Small test idea: compare allocating a 64-byte struct using new vs allocation from SimplePool in a tight hot loop, reporting p50/p99.
(Adapt the bench.h harness above. Build with -O3 -march=native -flto for realistic results.)
-O3 -march=native -flto for release builds (only when you can control deployment CPU families).-g + -Og for development performance debugging; optimize only when measured.-fno-omit-frame-pointer when profiling with perf to get accurate stacks.objdump -d or Compiler Explorer for inspecting generated assembly in hot loops.__builtin_prefetch: wrong distance causes harm.new/delete on hot paths with a preallocated pool or slab allocator.Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.