Software engineers building performance-sensitive systems (trading engines, networking, real-time analytics) benefit enormously from understanding the hardware they run on. This article translates CPU microarchitecture concepts — caches, TLBs, pipelining, out-of-order execution, SMT, and branch prediction — into practical rules, code examples, and measurement techniques you can apply today.
We emphasize predictability and tail latency (P99/P99.9) rather than headline throughput numbers. Small changes at the code or allocation level can reduce tail latency by orders of magnitude if they avoid pathological hardware interactions.
Modern CPUs are deep, complex machines: multiple cache levels, translation lookaside buffers (TLBs), speculative execution, and many parallel execution units. When your code touches the wrong working set, or triggers repeated mispredictions, the result is not a slightly slower program — it's sharp tail latency spikes.
Design goal: keep hot code and frequently accessed data small, cache-friendly, and lock-free where possible. Measure, then change.
Before optimizing, measure:
Small microbenchmark harness (C++ pseudo):
1#include <chrono>
2#include <vector>
3#include <numeric>
4#include <algorithm>
5
6using ns = std::chrono::nanoseconds;
7using clk = std::chrono::high_resolution_clock;
8
9std::vector<long long> run_many(std::function<void()> f, int runs=100000) {
10 std::vector<long long> out;
11 out.reserve(runs);
12 for (int i=0;i<runs;i++) {
13 auto t1 = clk::now();
14 f();
15 auto t2 = clk::now();
16 out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
17 }
18 return out;
19}
20
21// Report P50/P90/P99
22void report(std::vector<long long>& a) {
23 std::sort(a.begin(), a.end());
24 auto p = [&](double q){ return a[size_t(q*a.size())]; };
25 printf("p50=%lld ns p90=%lld ns p99=%lld ns\n", p(0.50), p(0.90), p(0.99));
26}
27This is enough to get started; for production, use perf, eBPF, or platform-specific profilers.
CPUs have multiple cache levels (L1, L2, L3). L1 is tiny and extremely fast; L3 is larger and shared between cores. The most important rule: keep your hot working set smaller than the cache you rely on.
Practical guidelines:
Example: false sharing
1struct Counters {
2 alignas(64) uint64_t a;
3 alignas(64) uint64_t b;
4};
5
6// If `a` and `b` are frequently updated on different threads, aligning them prevents false sharing.
7When you see periodic latency spikes, check for false sharing by looking at cache-misses per core with perf.
Virtual memory requires address translation via page tables. The TLB caches these translations; TLB misses are expensive because they walk page tables. Small random accesses across many pages can thrash the TLB.
Mitigations:
perf stat -e dTLB-load-misses to quantify TLB activity.Warning: huge pages complicate allocation and operations like fork(). Use them where they measurably help.
CPUs issue instructions in order but execute many out-of-order to keep execution units busy. This hides latency of long-latency operations (cache misses, memory loads). However, some software assumptions (especially about timing) break if you rely on in-order execution.
Important takeaways:
Branch prediction keeps pipelines full. A correctly predicted branch costs almost nothing; a misprediction causes pipeline flushes and costs tens to hundreds of cycles depending on pipeline depth.
Practical rule: make hot paths as branch-predictable as possible. When branching behavior is data-dependent and unpredictable, prefer branchless code or table-driven logic.
Example: branch vs branchless
1// Branchy version
2int sign_branch(int x) {
3 if (x > 0) return 1;
4 if (x < 0) return -1;
5 return 0;
6}
7
8// Branchless version using comparisons mapped to integers
9int sign_branchless(int x) {
10 return (x > 0) - (x < 0);
11}
12Microbenchmarks often show the branchless variant has fewer high-latency outliers when x distribution is unpredictable.
Demonstration of misprediction cost (conceptual):
1// Measure cost of unpredictable branch vs predictable branch by feeding random vs skewed inputs
2If a misprediction costs ~15–20 cycles on modern CPUs and your code loops millions of times, mispredictions can dominate tail latency.
Prefetching tells the CPU to fetch a cache line before you use it. Used carefully, prefetch reduces stalls on predictable streams (e.g., sequential parsing of arrays).
C++ example:
1for (size_t i = 0; i < n; ++i) {
2 if (i + 8 < n) __builtin_prefetch(&arr[i + 8]);
3 process(arr[i]);
4}
5Prefetch offsets should be tuned to your workload (latency and CPU frequency). Over-prefetching wastes bandwidth.
Atomics are essential for correctness in concurrent code; but the memory order you choose affects performance.
memory_order_relaxed — no ordering guarantees, cheapest.memory_order_acquire / memory_order_release — ordering for synchronization.memory_order_seq_cst — strongest guarantees, typically slower.Example: lock-free flag with release/acquire semantics
1std::atomic<bool> ready{false};
2std::vector<int> data;
3
4// Producer
5void produce() {
6 data.push_back(42);
7 ready.store(true, std::memory_order_release);
8}
9
10// Consumer
11void consume() {
12 while (!ready.load(std::memory_order_acquire)) {}
13 // Now safely read `data`
14}
15Avoid unnecessary seq_cst operations on hot paths. Use relaxed where you only need atomicity and not ordering.
Hardware fences (like asm volatile("mfence")) are rarely necessary if you use the C++ memory model correctly.
On multi-socket systems, memory is attached to a NUMA node. Accessing remote memory incurs higher latency. Also, cache-coherency traffic can dominate when many cores write the same cache line.
Practical rules:
numactl, mbind, pthread_setaffinity_np).Start with a naive pipeline that parses messages into heap-allocated objects and updates shared counters under a global lock. Problems: allocation variability, cache thrash, lock contention.
Refactor checklist:
The cumulative result should shift the distribution left and tighten the tail.
Custom per-thread pool (sketch):
1// Very small example; production pools need safety and replenishment
2struct Pool {
3 std::vector<char*> slabs;
4 size_t next = 0;
5 char* alloc() { return slabs[next++]; }
6};
7
8thread_local Pool tl_pool;
9
10void* hot_alloc() { return tl_pool.alloc(); }
11Lock-free handoff (single-producer single-consumer ring): prefer well-tested implementations (boost::lockfree, folly's SPSC queue) rather than rolling your own.
memory_order_relaxed where ordering is not required; prefer acquire/release for synchronization.Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.