NV
NordVarg
ServicesTechnologiesIndustriesCase StudiesBlogAboutContact
Get Started

Footer

NV
NordVarg

Software Development & Consulting

GitHubLinkedInTwitter

Services

  • Product Development
  • Quantitative Finance
  • Financial Systems
  • ML & AI

Technologies

  • C++
  • Python
  • Rust
  • OCaml
  • TypeScript
  • React

Company

  • About
  • Case Studies
  • Blog
  • Contact

© 2025 NordVarg. All rights reserved.

November 11, 2025
•
NordVarg Team
•

CPU Internals for Software Engineers: Caches, Pipelines, and the Cost of a Branch

PerformanceCPUArchitecturePerformanceC++Low-Latency
7 min read
Share:

Introduction#

Software engineers building performance-sensitive systems (trading engines, networking, real-time analytics) benefit enormously from understanding the hardware they run on. This article translates CPU microarchitecture concepts — caches, TLBs, pipelining, out-of-order execution, SMT, and branch prediction — into practical rules, code examples, and measurement techniques you can apply today.

We emphasize predictability and tail latency (P99/P99.9) rather than headline throughput numbers. Small changes at the code or allocation level can reduce tail latency by orders of magnitude if they avoid pathological hardware interactions.

Why microarchitecture matters (short)#

Modern CPUs are deep, complex machines: multiple cache levels, translation lookaside buffers (TLBs), speculative execution, and many parallel execution units. When your code touches the wrong working set, or triggers repeated mispredictions, the result is not a slightly slower program — it's sharp tail latency spikes.

Design goal: keep hot code and frequently accessed data small, cache-friendly, and lock-free where possible. Measure, then change.

Measurement primer (how to reason about results)#

Before optimizing, measure:

  • Use hardware counters (perf, Intel VTune) for event-level visibility.
  • Use high-resolution timers for microbenchmarks: rdtsc or clock_gettime(CLOCK_MONOTONIC_RAW).
  • Report distributions, not just averages: P50, P90, P99, P99.9. Tail latency matters.

Small microbenchmark harness (C++ pseudo):

cpp
1#include <chrono>
2#include <vector>
3#include <numeric>
4#include <algorithm>
5
6using ns = std::chrono::nanoseconds;
7using clk = std::chrono::high_resolution_clock;
8
9std::vector<long long> run_many(std::function<void()> f, int runs=100000) {
10    std::vector<long long> out;
11    out.reserve(runs);
12    for (int i=0;i<runs;i++) {
13        auto t1 = clk::now();
14        f();
15        auto t2 = clk::now();
16        out.push_back(std::chrono::duration_cast<ns>(t2 - t1).count());
17    }
18    return out;
19}
20
21// Report P50/P90/P99
22void report(std::vector<long long>& a) {
23    std::sort(a.begin(), a.end());
24    auto p = [&](double q){ return a[size_t(q*a.size())]; };
25    printf("p50=%lld ns p90=%lld ns p99=%lld ns\n", p(0.50), p(0.90), p(0.99));
26}
27

This is enough to get started; for production, use perf, eBPF, or platform-specific profilers.

Cache hierarchy & working set design#

CPUs have multiple cache levels (L1, L2, L3). L1 is tiny and extremely fast; L3 is larger and shared between cores. The most important rule: keep your hot working set smaller than the cache you rely on.

Practical guidelines:

  • Put hot, frequently-read data in a compact contiguous array rather than a pointer-chasing structure.
  • Avoid random heap allocations on hot paths; they increase TLB pressure and cause cache-line scattering.
  • Align frequently-updated variables to cache-line boundaries if they are updated by different cores to avoid false sharing (typically 64 bytes on x86).

Example: false sharing

cpp
1struct Counters {
2    alignas(64) uint64_t a;
3    alignas(64) uint64_t b;
4};
5
6// If `a` and `b` are frequently updated on different threads, aligning them prevents false sharing.
7

When you see periodic latency spikes, check for false sharing by looking at cache-misses per core with perf.

TLBs, page sizes, and huge pages#

Virtual memory requires address translation via page tables. The TLB caches these translations; TLB misses are expensive because they walk page tables. Small random accesses across many pages can thrash the TLB.

Mitigations:

  • Use contiguous memory (arrays) to reduce number of pages touched.
  • Use huge pages (2MiB or 1GiB) when you have large, mostly-read datasets (e.g., market data snapshots) to reduce TLB pressure.
  • Use tools like perf stat -e dTLB-load-misses to quantify TLB activity.

Warning: huge pages complicate allocation and operations like fork(). Use them where they measurably help.

Pipelines and out-of-order execution (what reordering means for you)#

CPUs issue instructions in order but execute many out-of-order to keep execution units busy. This hides latency of long-latency operations (cache misses, memory loads). However, some software assumptions (especially about timing) break if you rely on in-order execution.

Important takeaways:

  • Avoid long-latency synchronous memory operations on the hot path. When unavoidable, try to overlap them (software prefetch, batching, asynchronous IO).
  • Compilers and CPUs both reorder; use language memory-model primitives (atomics and fences) where ordering matters.

Branch prediction and the cost of a misprediction#

Branch prediction keeps pipelines full. A correctly predicted branch costs almost nothing; a misprediction causes pipeline flushes and costs tens to hundreds of cycles depending on pipeline depth.

Practical rule: make hot paths as branch-predictable as possible. When branching behavior is data-dependent and unpredictable, prefer branchless code or table-driven logic.

Example: branch vs branchless

cpp
1// Branchy version
2int sign_branch(int x) {
3    if (x > 0) return 1;
4    if (x < 0) return -1;
5    return 0;
6}
7
8// Branchless version using comparisons mapped to integers
9int sign_branchless(int x) {
10    return (x > 0) - (x < 0);
11}
12

Microbenchmarks often show the branchless variant has fewer high-latency outliers when x distribution is unpredictable.

Demonstration of misprediction cost (conceptual):

cpp
1// Measure cost of unpredictable branch vs predictable branch by feeding random vs skewed inputs
2

If a misprediction costs ~15–20 cycles on modern CPUs and your code loops millions of times, mispredictions can dominate tail latency.

Software prefetching and streaming access patterns#

Prefetching tells the CPU to fetch a cache line before you use it. Used carefully, prefetch reduces stalls on predictable streams (e.g., sequential parsing of arrays).

C++ example:

cpp
1for (size_t i = 0; i < n; ++i) {
2    if (i + 8 < n) __builtin_prefetch(&arr[i + 8]);
3    process(arr[i]);
4}
5

Prefetch offsets should be tuned to your workload (latency and CPU frequency). Over-prefetching wastes bandwidth.

Atomic operations, memory ordering, and fences (practical examples)#

Atomics are essential for correctness in concurrent code; but the memory order you choose affects performance.

  • memory_order_relaxed — no ordering guarantees, cheapest.
  • memory_order_acquire / memory_order_release — ordering for synchronization.
  • memory_order_seq_cst — strongest guarantees, typically slower.

Example: lock-free flag with release/acquire semantics

cpp
1std::atomic<bool> ready{false};
2std::vector<int> data;
3
4// Producer
5void produce() {
6    data.push_back(42);
7    ready.store(true, std::memory_order_release);
8}
9
10// Consumer
11void consume() {
12    while (!ready.load(std::memory_order_acquire)) {}
13    // Now safely read `data`
14}
15

Avoid unnecessary seq_cst operations on hot paths. Use relaxed where you only need atomicity and not ordering.

Hardware fences (like asm volatile("mfence")) are rarely necessary if you use the C++ memory model correctly.

NUMA and cache-coherency implications#

On multi-socket systems, memory is attached to a NUMA node. Accessing remote memory incurs higher latency. Also, cache-coherency traffic can dominate when many cores write the same cache line.

Practical rules:

  • Pin threads to cores (CPU affinity) and allocate memory on the local NUMA node (numactl, mbind, pthread_setaffinity_np).
  • For per-thread buffers, prefer thread-local allocation to reduce cross-node traffic.
  • Avoid hot shared writable locations. Use per-core queues or sharded counters and occasionally aggregate.

Example: making a small tick processor predictable#

Start with a naive pipeline that parses messages into heap-allocated objects and updates shared counters under a global lock. Problems: allocation variability, cache thrash, lock contention.

Refactor checklist:

  1. Use a memory pool (preallocated, per-thread) for message objects.
  2. Parse into contiguous buffers (avoid pointer-chasing structs).
  3. Use per-core queues and a single writer for shared structures (or lock-free aggregation).
  4. Pin threads and isolate NICs / interrupts to specific cores.

The cumulative result should shift the distribution left and tighten the tail.

Short code snippets and idioms#

Custom per-thread pool (sketch):

cpp
1// Very small example; production pools need safety and replenishment
2struct Pool {
3    std::vector<char*> slabs;
4    size_t next = 0;
5    char* alloc() { return slabs[next++]; }
6};
7
8thread_local Pool tl_pool;
9
10void* hot_alloc() { return tl_pool.alloc(); }
11

Lock-free handoff (single-producer single-consumer ring): prefer well-tested implementations (boost::lockfree, folly's SPSC queue) rather than rolling your own.

Checklist: Practical rules of thumb#

  • Measure first. Prefer percentiles to means.
  • Keep hot data contiguous and small.
  • Align frequently-updated fields to cache-line boundaries to avoid false sharing.
  • Reduce TLB pressure by using contiguous allocations or huge pages for large datasets.
  • Make branches predictable; if not possible, prefer branchless alternatives.
  • Pin threads to cores and allocate memory near the thread (NUMA-awareness).
  • Use memory_order_relaxed where ordering is not required; prefer acquire/release for synchronization.
  • Avoid heap allocations on hot paths; use preallocated pools.
  • Use hardware-assisted counters and tracing (perf, eBPF) for deep investigations.

Further reading#

  • Intel 64 and IA-32 Architectures Optimization Reference Manual
  • Brendan Gregg — Linux Performance (perf techniques)
  • "Computer Architecture: A Quantitative Approach" (Hennessy & Patterson) — for deeper background
  • Agner Fog — optimization manuals and instruction tables
NT

NordVarg Team

Technical Writer

NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.

CPUArchitecturePerformanceC++Low-Latency

Join 1,000+ Engineers

Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.

✓Weekly articles
✓Industry insights
✓No spam, ever

Related Posts

Nov 11, 2025•7 min read
Practical C++ for Sub‑Microsecond Latency: Micro‑Optimizations That Actually Matter
PerformanceC++Low-Latency
Oct 22, 2024•14 min read
Building a High-Frequency Market Data Feed: Architecture and Optimization
Designing and implementing ultra-low latency market data feeds that process millions of messages per second with microsecond precision
PerformanceMarket DataLow Latency
Oct 15, 2024•14 min read
Building Ultra-Low Latency Systems: The $10M Microsecond
How we reduced trading system latency from 500μs to 50μs—and why every microsecond matters in high-frequency trading
PerformanceC++Low-Latency

Interested in working together?