Apparatus

Optimizing a simulation program in C++

In my last blog post, I simulated a floating body drifting away from shore using a simple random walk model:

  1. At each step, the body moves left or right one step with an equal probability
  2. The body cannot move to the shore, so if its co-ordinate is already 0, it cannot move left

Hand optimizing this simple model was a fun exercise. I simulated with my 8-core i7 laptop. Initially, I had something that barely progressed after 24 hours. After some trial and error, I was able to run \( 2^{24} \) walks with \( 2^{24} \) steps each in 12 hours.

The source code is available in Github. Here are some optimizations I employed. I hope it gives you hints on making simulations written in C++ run faster!

Premature optimization is the root of all evil. You should benchmark any optimization tricks against “unoptimized” code. More often than not the compiler already generated heavily optimized binary from idiomatic source code. Basic understanding of CPU instruction sets, assembly language, and using disassemblers to analyze the supposedly “optimized” binaries are important skills if you’re after micro-optimized binaries.

If you use GCC and want to try it out yourself, remember to compile with full optimization and compile for native architecture:

$ g++ -std=c++20 -O3 -march=native -o drifting drifting.cc

If you’re not using x64, you might not end up with a similar result.

Align memory to CPU cache lines

A simulation program that runs multiple threads in parallel has to be extra careful about memory alignment. That’s because RAM is divided into cache lines whose size depends on the CPU architecture. Even if two processor cores are not accessing the same memory addresses in parallel, merely accessing memory in the same cache line causes unnecessary cache reloading. This phenomenon is called false sharing.

x64 uses 64-byte cache lines (the number in your CPU architecture may be different). Hence two threads should align their memory buffers to 64-byte boundaries. In C++17 and later, the easiest way to achieve this is using the std::aligned_alloc() function:

const auto walks = static_cast<std::int32_t*>(std::aligned_alloc(64, N_WALKS * sizeof(std::int32_t)));

Aligning to cache lines helps only if the buffers are not shared. If some memory needs to be modified by one thread and read by another, cache invalidation is inevitable.

Somewhat related to cache lines are memory pages. The latter is related to how the operating system handles virtual memory. Thus it bears no relation to preventing overhead due to concurrent memory access. Different threads in the same process share the same memory space. Aligning to memory pages didn’t have any observable effect on the program performance.

Leverage vectorization

Many CPU instruction sets have vector extensions allowing similar operations (usually arithmetic) to apply to multiple memory addresses in parallel. For this post, the relevant x64 extension is called MMX, but there is an equivalent extension called SSE for floating-point arithmetic.

Luckily modern compilers help convert regular C++ loops into vectorized loops. Loops targeting for auto-vectorization should have known bounds matching the data size of the vectorized instructions, and the operations performed in those loops should have vectorized counterparts. So always be mindful of what your CPU supports, and don’t forget to read the manual.

Let’s take the constrained random walk from this post as an example. It is not possible to vectorize conditional branches. It means that the following implementation of the rules doesn’t work:

for (int i = 0; i < N_WALKS; ++i) {
    walk[i] += step;
    if (walk[i] < 0) {
        walk[i] = 0;
    }
}

In this simple model, it is easy to express the constraint of always having non-negative co-ordinate by using the std::max() function, which is vectorizable.

for (int i = 0; i < N_WALKS; ++i) {
    walk[i] = std::max(context.walk[i] + step, 0);
}

Please note that the actual loop in the simulation program was a bit more complex. You can refer to the source code for details.

Disassembling the compiled binary reveals that indeed the vectorized max instructions are being emitted to the compiled program:

$ objdump -d drifting | grep vpmaxsd
    4ab0: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0
    4b2c: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0
    4bbc: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0
    4c45: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0
    4cd1: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0
    4d59: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0
    4ddc: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0
    4e62: c4 e2 7d 3d c1 vpmaxsd %ymm1,%ymm0,%ymm0

Vectorized instructions are naturally very architecture-dependent. That is the main reason of using the -march=native option in the compilation.

Be mindful about the random number generator

A simulation program typically needs random numbers. I’m no expert who could discuss the merits of particular algorithms (I just went with the Mersenne Twister algorithm shipped with the C++ standard library), but whatever the algorithm, a considerable number of CPU cycles are spent inside the RNG. Given that a single invocation of Mersenne Twister generates 64 bits of random data and advancing a single walk only requires one bit, I could shift those bits around before running the actual simulation step using those random numbers in a (vectorized) loop:

std::array<std::int32_t, BITS_IN_RANDOM_VALUE> steps {
    static_cast<std::int32_t>((value & 0x0000000000000001ULL) << 1 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000002ULL)      ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000004ULL) >> 1 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000008ULL) >> 2 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000010ULL) >> 3 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000020ULL) >> 4 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000040ULL) >> 5 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000080ULL) >> 6 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000100ULL) >> 7 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000200ULL) >> 8 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000400ULL) >> 9 ) - 1,
    static_cast<std::int32_t>((value & 0x0000000000000800ULL) >> 10) - 1,
    static_cast<std::int32_t>((value & 0x0000000000001000ULL) >> 11) - 1,
    static_cast<std::int32_t>((value & 0x0000000000002000ULL) >> 12) - 1,
    static_cast<std::int32_t>((value & 0x0000000000004000ULL) >> 13) - 1,
    static_cast<std::int32_t>((value & 0x0000000000008000ULL) >> 14) - 1,
    static_cast<std::int32_t>((value & 0x0000000000010000ULL) >> 15) - 1,
    static_cast<std::int32_t>((value & 0x0000000000020000ULL) >> 16) - 1,
    static_cast<std::int32_t>((value & 0x0000000000040000ULL) >> 17) - 1,
    static_cast<std::int32_t>((value & 0x0000000000080000ULL) >> 18) - 1,
    static_cast<std::int32_t>((value & 0x0000000000100000ULL) >> 19) - 1,
    static_cast<std::int32_t>((value & 0x0000000000200000ULL) >> 20) - 1,
    static_cast<std::int32_t>((value & 0x0000000000400000ULL) >> 21) - 1,
    static_cast<std::int32_t>((value & 0x0000000000800000ULL) >> 22) - 1,
    static_cast<std::int32_t>((value & 0x0000000001000000ULL) >> 23) - 1,
    static_cast<std::int32_t>((value & 0x0000000002000000ULL) >> 24) - 1,
    static_cast<std::int32_t>((value & 0x0000000004000000ULL) >> 25) - 1,
    static_cast<std::int32_t>((value & 0x0000000008000000ULL) >> 26) - 1,
    static_cast<std::int32_t>((value & 0x0000000010000000ULL) >> 27) - 1,
    static_cast<std::int32_t>((value & 0x0000000020000000ULL) >> 28) - 1,
    static_cast<std::int32_t>((value & 0x0000000040000000ULL) >> 29) - 1,
    static_cast<std::int32_t>((value & 0x0000000080000000ULL) >> 30) - 1,
    static_cast<std::int32_t>((value & 0x0000000100000000ULL) >> 31) - 1,
    static_cast<std::int32_t>((value & 0x0000000200000000ULL) >> 32) - 1,
    static_cast<std::int32_t>((value & 0x0000000400000000ULL) >> 33) - 1,
    static_cast<std::int32_t>((value & 0x0000000800000000ULL) >> 34) - 1,
    static_cast<std::int32_t>((value & 0x0000001000000000ULL) >> 35) - 1,
    static_cast<std::int32_t>((value & 0x0000002000000000ULL) >> 36) - 1,
    static_cast<std::int32_t>((value & 0x0000004000000000ULL) >> 37) - 1,
    static_cast<std::int32_t>((value & 0x0000008000000000ULL) >> 38) - 1,
    static_cast<std::int32_t>((value & 0x0000010000000000ULL) >> 39) - 1,
    static_cast<std::int32_t>((value & 0x0000020000000000ULL) >> 40) - 1,
    static_cast<std::int32_t>((value & 0x0000040000000000ULL) >> 41) - 1,
    static_cast<std::int32_t>((value & 0x0000080000000000ULL) >> 42) - 1,
    static_cast<std::int32_t>((value & 0x0000100000000000ULL) >> 43) - 1,
    static_cast<std::int32_t>((value & 0x0000200000000000ULL) >> 44) - 1,
    static_cast<std::int32_t>((value & 0x0000400000000000ULL) >> 45) - 1,
    static_cast<std::int32_t>((value & 0x0000800000000000ULL) >> 46) - 1,
    static_cast<std::int32_t>((value & 0x0001000000000000ULL) >> 47) - 1,
    static_cast<std::int32_t>((value & 0x0002000000000000ULL) >> 48) - 1,
    static_cast<std::int32_t>((value & 0x0004000000000000ULL) >> 49) - 1,
    static_cast<std::int32_t>((value & 0x0008000000000000ULL) >> 50) - 1,
    static_cast<std::int32_t>((value & 0x0010000000000000ULL) >> 51) - 1,
    static_cast<std::int32_t>((value & 0x0020000000000000ULL) >> 52) - 1,
    static_cast<std::int32_t>((value & 0x0040000000000000ULL) >> 53) - 1,
    static_cast<std::int32_t>((value & 0x0080000000000000ULL) >> 54) - 1,
    static_cast<std::int32_t>((value & 0x0100000000000000ULL) >> 55) - 1,
    static_cast<std::int32_t>((value & 0x0200000000000000ULL) >> 56) - 1,
    static_cast<std::int32_t>((value & 0x0400000000000000ULL) >> 57) - 1,
    static_cast<std::int32_t>((value & 0x0800000000000000ULL) >> 58) - 1,
    static_cast<std::int32_t>((value & 0x1000000000000000ULL) >> 59) - 1,
    static_cast<std::int32_t>((value & 0x2000000000000000ULL) >> 60) - 1,
    static_cast<std::int32_t>((value & 0x4000000000000000ULL) >> 61) - 1,
    static_cast<std::int32_t>((value & 0x8000000000000000ULL) >> 62) - 1,
};

The above code turns a single random 64-bit integer into an array of 64 random steps (-1 or +1).

GCC couldn’t unroll the loop for me, so I unrolled it by hand. It is ugly but fast. Unrolling loops by hand is error-prone, as it often involves copy-pasted code with increasing or decreasing counters. This time some emacs regexp replace magic after copy-pasting the line 64 times was enough. Writing a script to generate the code of the unrolled loop would probably have been even safer.

I naturally only did the unrolling after checking the assembly code and benchmarking that the unrolled loop performed better than the original one.

Remember, premature optimization is the root of all evil!