CPU Internals for Software Engineers: Caches, Pipelines, and the Cost of a Branch

Introduction #

Software engineers building performance-sensitive systems (trading engines, networking, real-time analytics) benefit enormously from understanding the hardware they run on. This article translates CPU microarchitecture concepts — caches, TLBs, pipelining, out-of-order execution, SMT, and branch prediction — into practical rules, code examples, and measurement techniques you can apply today.

We emphasize predictability and tail latency (P99/P99.9) rather than headline throughput numbers. Small changes at the code or allocation level can reduce tail latency by orders of magnitude if they avoid pathological hardware interactions.

Why microarchitecture matters (short)#

Modern CPUs are deep, complex machines: multiple cache levels, translation lookaside buffers (TLBs), speculative execution, and many parallel execution units. When your code touches the wrong working set, or triggers repeated mispredictions, the result is not a slightly slower program — it's sharp tail latency spikes.

Design goal: keep hot code and frequently accessed data small, cache-friendly, and lock-free where possible. Measure, then change.

Measurement primer (how to reason about results)#

Before optimizing, measure:

Use hardware counters (perf, Intel VTune) for event-level visibility.
Use high-resolution timers for microbenchmarks: rdtsc or clock_gettime(CLOCK_MONOTONIC_RAW).
Report distributions, not just averages: P50, P90, P99, P99.9. Tail latency matters.

Small microbenchmark harness (C++ pseudo):

cpp

1#include <chrono>
2#include <vector>
3#include <numeric>
4#include <algorithm>
5
6using ns = std::chrono::nanoseconds;
7using clk = std::chrono::high_resolution_clock;
8
9std::vector<long long> run_many(std::function<void()> f, int runs=100000) {
10    std::vector<long long> out;
11    out.reserve(runs);
12    for (int i=0;i<runs;i++) {
13        auto t1 = clk::now();
14        f();
15        auto t2 = clk::now();
16        out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
17    }
18    return out;
19}
20
21// Report P50/P90/P99
22void report(std::vector<long long>& a) {
23    std::sort(a.begin(), a.end());
24    auto p = [&](double q){ return a[size_t(q*a.size())]; };
25    printf("p50=%lld ns p90=%lld ns p99=%lld ns\n", p(0.50), p(0.90), p(0.99));
26}
27

This is enough to get started; for production, use perf, eBPF, or platform-specific profilers.

Cache hierarchy & working set design #

CPUs have multiple cache levels (L1, L2, L3). L1 is tiny and extremely fast; L3 is larger and shared between cores. The most important rule: keep your hot working set smaller than the cache you rely on.

Practical guidelines:

Put hot, frequently-read data in a compact contiguous array rather than a pointer-chasing structure.
Avoid random heap allocations on hot paths; they increase TLB pressure and cause cache-line scattering.
Align frequently-updated variables to cache-line boundaries if they are updated by different cores to avoid false sharing (typically 64 bytes on x86).

Example: false sharing

cpp

1struct Counters {
2    alignas(64) uint64_t a;
3    alignas(64) uint64_t b;
4};
5
6// If `a` and `b` are frequently updated on different threads, aligning them prevents false sharing.
7

When you see periodic latency spikes, check for false sharing by looking at cache-misses per core with perf.

TLBs, page sizes, and huge pages #

Virtual memory requires address translation via page tables. The TLB caches these translations; TLB misses are expensive because they walk page tables. Small random accesses across many pages can thrash the TLB.

Mitigations:

Use contiguous memory (arrays) to reduce number of pages touched.
Use huge pages (2MiB or 1GiB) when you have large, mostly-read datasets (e.g., market data snapshots) to reduce TLB pressure.
Use tools like perf stat -e dTLB-load-misses to quantify TLB activity.

Warning: huge pages complicate allocation and operations like fork(). Use them where they measurably help.

Pipelines and out-of-order execution (what reordering means for you)#

CPUs issue instructions in order but execute many out-of-order to keep execution units busy. This hides latency of long-latency operations (cache misses, memory loads). However, some software assumptions (especially about timing) break if you rely on in-order execution.

Important takeaways:

Avoid long-latency synchronous memory operations on the hot path. When unavoidable, try to overlap them (software prefetch, batching, asynchronous IO).
Compilers and CPUs both reorder; use language memory-model primitives (atomics and fences) where ordering matters.

Branch prediction and the cost of a misprediction #

Branch prediction keeps pipelines full. A correctly predicted branch costs almost nothing; a misprediction causes pipeline flushes and costs tens to hundreds of cycles depending on pipeline depth.

Practical rule: make hot paths as branch-predictable as possible. When branching behavior is data-dependent and unpredictable, prefer branchless code or table-driven logic.

Example: branch vs branchless

cpp

1// Branchy version
2int sign_branch(int x) {
3    if (x > 0) return 1;
4    if (x < 0) return -1;
5    return 0;
6}
7
8// Branchless version using comparisons mapped to integers
9int sign_branchless(int x) {
10    return (x > 0) - (x < 0);
11}
12

Microbenchmarks often show the branchless variant has fewer high-latency outliers when x distribution is unpredictable.

Demonstration of misprediction cost (conceptual):

cpp

1// Measure cost of unpredictable branch vs predictable branch by feeding random vs skewed inputs
2

If a misprediction costs ~15–20 cycles on modern CPUs and your code loops millions of times, mispredictions can dominate tail latency.

Software prefetching and streaming access patterns #

Prefetching tells the CPU to fetch a cache line before you use it. Used carefully, prefetch reduces stalls on predictable streams (e.g., sequential parsing of arrays).

C++ example:

cpp

1for (size_t i = 0; i < n; ++i) {
2    if (i + 8 < n) __builtin_prefetch(&arr[i + 8]);
3    process(arr[i]);
4}
5

Prefetch offsets should be tuned to your workload (latency and CPU frequency). Over-prefetching wastes bandwidth.

Atomic operations, memory ordering, and fences (practical examples)#

Atomics are essential for correctness in concurrent code; but the memory order you choose affects performance.

memory_order_relaxed — no ordering guarantees, cheapest.
memory_order_acquire / memory_order_release — ordering for synchronization.
memory_order_seq_cst — strongest guarantees, typically slower.

Example: lock-free flag with release/acquire semantics

cpp

1std::atomic<bool> ready{false};
2std::vector<int> data;
3
4// Producer
5void produce() {
6    data.push_back(42);
7    ready.store(true, std::memory_order_release);
8}
9
10// Consumer
11void consume() {
12    while (!ready.load(std::memory_order_acquire)) {}
13    // Now safely read `data`
14}
15

Avoid unnecessary seq_cst operations on hot paths. Use relaxed where you only need atomicity and not ordering.

Hardware fences (like asm volatile("mfence")) are rarely necessary if you use the C++ memory model correctly.

NUMA and cache-coherency implications #

On multi-socket systems, memory is attached to a NUMA node. Accessing remote memory incurs higher latency. Also, cache-coherency traffic can dominate when many cores write the same cache line.

Practical rules:

Pin threads to cores (CPU affinity) and allocate memory on the local NUMA node (numactl, mbind, pthread_setaffinity_np).
For per-thread buffers, prefer thread-local allocation to reduce cross-node traffic.
Avoid hot shared writable locations. Use per-core queues or sharded counters and occasionally aggregate.

Example: making a small tick processor predictable #

Start with a naive pipeline that parses messages into heap-allocated objects and updates shared counters under a global lock. Problems: allocation variability, cache thrash, lock contention.

Refactor checklist:

Use a memory pool (preallocated, per-thread) for message objects.
Parse into contiguous buffers (avoid pointer-chasing structs).
Use per-core queues and a single writer for shared structures (or lock-free aggregation).
Pin threads and isolate NICs / interrupts to specific cores.

The cumulative result should shift the distribution left and tighten the tail.

Short code snippets and idioms #

Custom per-thread pool (sketch):

cpp

1// Very small example; production pools need safety and replenishment
2struct Pool {
3    std::vector<char*> slabs;
4    size_t next = 0;
5    char* alloc() { return slabs[next++]; }
6};
7
8thread_local Pool tl_pool;
9
10void* hot_alloc() { return tl_pool.alloc(); }
11

Lock-free handoff (single-producer single-consumer ring): prefer well-tested implementations (boost::lockfree, folly's SPSC queue) rather than rolling your own.

Checklist: Practical rules of thumb #

Measure first. Prefer percentiles to means.
Keep hot data contiguous and small.
Align frequently-updated fields to cache-line boundaries to avoid false sharing.
Reduce TLB pressure by using contiguous allocations or huge pages for large datasets.
Make branches predictable; if not possible, prefer branchless alternatives.
Pin threads to cores and allocate memory near the thread (NUMA-awareness).
Use memory_order_relaxed where ordering is not required; prefer acquire/release for synchronization.
Avoid heap allocations on hot paths; use preallocated pools.
Use hardware-assisted counters and tracing (perf, eBPF) for deep investigations.

Introduction #

Why microarchitecture matters (short)#

Design goal: keep hot code and frequently accessed data small, cache-friendly, and lock-free where possible. Measure, then change.

Measurement primer (how to reason about results)#

Before optimizing, measure:

Use hardware counters (perf, Intel VTune) for event-level visibility.
Use high-resolution timers for microbenchmarks: rdtsc or clock_gettime(CLOCK_MONOTONIC_RAW).
Report distributions, not just averages: P50, P90, P99, P99.9. Tail latency matters.

Small microbenchmark harness (C++ pseudo):

cpp

1#include <chrono>
2#include <vector>
3#include <numeric>
4#include <algorithm>
5
6using ns = std::chrono::nanoseconds;
7using clk = std::chrono::high_resolution_clock;
8
9std::vector<long long> run_many(std::function<void()> f, int runs=100000) {
10    std::vector<long long> out;
11    out.reserve(runs);
12    for (int i=0;i<runs;i++) {
13        auto t1 = clk::now();
14        f();
15        auto t2 = clk::now();
16        out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
17    }
18    return out;
19}
20
21// Report P50/P90/P99
22void report(std::vector<long long>& a) {
23    std::sort(a.begin(), a.end());
24    auto p = [&](double q){ return a[size_t(q*a.size())]; };
25    printf("p50=%lld ns p90=%lld ns p99=%lld ns\n", p(0.50), p(0.90), p(0.99));
26}
27

This is enough to get started; for production, use perf, eBPF, or platform-specific profilers.

Cache hierarchy & working set design #

Practical guidelines:

Put hot, frequently-read data in a compact contiguous array rather than a pointer-chasing structure.
Avoid random heap allocations on hot paths; they increase TLB pressure and cause cache-line scattering.
Align frequently-updated variables to cache-line boundaries if they are updated by different cores to avoid false sharing (typically 64 bytes on x86).

Example: false sharing

cpp

1struct Counters {
2    alignas(64) uint64_t a;
3    alignas(64) uint64_t b;
4};
5
6// If `a` and `b` are frequently updated on different threads, aligning them prevents false sharing.
7

When you see periodic latency spikes, check for false sharing by looking at cache-misses per core with perf.

TLBs, page sizes, and huge pages #

Mitigations:

Use contiguous memory (arrays) to reduce number of pages touched.
Use huge pages (2MiB or 1GiB) when you have large, mostly-read datasets (e.g., market data snapshots) to reduce TLB pressure.
Use tools like perf stat -e dTLB-load-misses to quantify TLB activity.

Warning: huge pages complicate allocation and operations like fork(). Use them where they measurably help.

Pipelines and out-of-order execution (what reordering means for you)#

Important takeaways:

Avoid long-latency synchronous memory operations on the hot path. When unavoidable, try to overlap them (software prefetch, batching, asynchronous IO).
Compilers and CPUs both reorder; use language memory-model primitives (atomics and fences) where ordering matters.

Branch prediction and the cost of a misprediction #

Branch prediction keeps pipelines full. A correctly predicted branch costs almost nothing; a misprediction causes pipeline flushes and costs tens to hundreds of cycles depending on pipeline depth.

Practical rule: make hot paths as branch-predictable as possible. When branching behavior is data-dependent and unpredictable, prefer branchless code or table-driven logic.

Example: branch vs branchless

cpp

1// Branchy version
2int sign_branch(int x) {
3    if (x > 0) return 1;
4    if (x < 0) return -1;
5    return 0;
6}
7
8// Branchless version using comparisons mapped to integers
9int sign_branchless(int x) {
10    return (x > 0) - (x < 0);
11}
12

Microbenchmarks often show the branchless variant has fewer high-latency outliers when x distribution is unpredictable.

Demonstration of misprediction cost (conceptual):

cpp

1// Measure cost of unpredictable branch vs predictable branch by feeding random vs skewed inputs
2

If a misprediction costs ~15–20 cycles on modern CPUs and your code loops millions of times, mispredictions can dominate tail latency.

Software prefetching and streaming access patterns #

Prefetching tells the CPU to fetch a cache line before you use it. Used carefully, prefetch reduces stalls on predictable streams (e.g., sequential parsing of arrays).

C++ example:

cpp

1for (size_t i = 0; i < n; ++i) {
2    if (i + 8 < n) __builtin_prefetch(&arr[i + 8]);
3    process(arr[i]);
4}
5

Prefetch offsets should be tuned to your workload (latency and CPU frequency). Over-prefetching wastes bandwidth.

Atomic operations, memory ordering, and fences (practical examples)#

Atomics are essential for correctness in concurrent code; but the memory order you choose affects performance.

memory_order_relaxed — no ordering guarantees, cheapest.
memory_order_acquire / memory_order_release — ordering for synchronization.
memory_order_seq_cst — strongest guarantees, typically slower.

Example: lock-free flag with release/acquire semantics

cpp

1std::atomic<bool> ready{false};
2std::vector<int> data;
3
4// Producer
5void produce() {
6    data.push_back(42);
7    ready.store(true, std::memory_order_release);
8}
9
10// Consumer
11void consume() {
12    while (!ready.load(std::memory_order_acquire)) {}
13    // Now safely read `data`
14}
15

Avoid unnecessary seq_cst operations on hot paths. Use relaxed where you only need atomicity and not ordering.

Hardware fences (like asm volatile("mfence")) are rarely necessary if you use the C++ memory model correctly.

NUMA and cache-coherency implications #

On multi-socket systems, memory is attached to a NUMA node. Accessing remote memory incurs higher latency. Also, cache-coherency traffic can dominate when many cores write the same cache line.

Practical rules:

Pin threads to cores (CPU affinity) and allocate memory on the local NUMA node (numactl, mbind, pthread_setaffinity_np).
For per-thread buffers, prefer thread-local allocation to reduce cross-node traffic.
Avoid hot shared writable locations. Use per-core queues or sharded counters and occasionally aggregate.

Example: making a small tick processor predictable #

Start with a naive pipeline that parses messages into heap-allocated objects and updates shared counters under a global lock. Problems: allocation variability, cache thrash, lock contention.

Refactor checklist:

Use a memory pool (preallocated, per-thread) for message objects.
Parse into contiguous buffers (avoid pointer-chasing structs).
Use per-core queues and a single writer for shared structures (or lock-free aggregation).
Pin threads and isolate NICs / interrupts to specific cores.

The cumulative result should shift the distribution left and tighten the tail.

Short code snippets and idioms #

Custom per-thread pool (sketch):

cpp

1// Very small example; production pools need safety and replenishment
2struct Pool {
3    std::vector<char*> slabs;
4    size_t next = 0;
5    char* alloc() { return slabs[next++]; }
6};
7
8thread_local Pool tl_pool;
9
10void* hot_alloc() { return tl_pool.alloc(); }
11

Lock-free handoff (single-producer single-consumer ring): prefer well-tested implementations (boost::lockfree, folly's SPSC queue) rather than rolling your own.

Checklist: Practical rules of thumb #

Measure first. Prefer percentiles to means.
Keep hot data contiguous and small.
Align frequently-updated fields to cache-line boundaries to avoid false sharing.
Reduce TLB pressure by using contiguous allocations or huge pages for large datasets.
Make branches predictable; if not possible, prefer branchless alternatives.
Pin threads to cores and allocate memory near the thread (NUMA-awareness).
Use memory_order_relaxed where ordering is not required; prefer acquire/release for synchronization.
Avoid heap allocations on hot paths; use preallocated pools.
Use hardware-assisted counters and tracing (perf, eBPF) for deep investigations.

CPU Internals for Software Engineers: Caches, Pipelines, and the Cost of a Branch

Introduction #

Why microarchitecture matters (short)#

Measurement primer (how to reason about results)#

Cache hierarchy & working set design #

TLBs, page sizes, and huge pages #

Pipelines and out-of-order execution (what reordering means for you)#

Branch prediction and the cost of a misprediction #

Software prefetching and streaming access patterns #

Atomic operations, memory ordering, and fences (practical examples)#

NUMA and cache-coherency implications #

Example: making a small tick processor predictable #

Short code snippets and idioms #

Checklist: Practical rules of thumb #

Further reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts

CPU Internals for Software Engineers: Caches, Pipelines, and the Cost of a Branch

Introduction #

Why microarchitecture matters (short)#

Measurement primer (how to reason about results)#

Cache hierarchy & working set design #

TLBs, page sizes, and huge pages #

Pipelines and out-of-order execution (what reordering means for you)#

Branch prediction and the cost of a misprediction #

Software prefetching and streaming access patterns #

Atomic operations, memory ordering, and fences (practical examples)#

NUMA and cache-coherency implications #

Example: making a small tick processor predictable #

Short code snippets and idioms #

Checklist: Practical rules of thumb #

Further reading #

NordVarg Team

Join 1,000+ Engineers

Related Posts