SIMD City: Auto-vectorisation
Written by me, proof-read by an LLM.
Details at end.
It's time to look at one of the most sophisticated optimisations compilers can do: autovectorisation. Most "big data" style problems boil down to "do this maths to huge arrays", and the limiting factor isn't the maths itself, but the feeding of instructions to the CPU, along with the data it needs.
To help with this problem, CPU designers came up with SIMD: "Single Instruction, Multiple Data". One instruction tells the CPU what to do with a whole chunk of data. These chunks could be 2, 4, 8, 16 or similar units of integers or floating point values, all treated individually. Initially the only way to use this capability was to write assembly language directly, but luckily for us, compilers are now able to help.1
To take advantage of SIMD we need to ensure our data is laid out in a nice long line, like an array, or vector. It's also much better to store different "fields" of the data in separate arrays. We're going to start with some integer maths - let's update an array to be element-wise the max of and :32xxy
In this -compiled code, we don't see any vectorisation, just the nice tight loop code we've come to expect:-O24
To ramp up to the next level, we'll need to add two flags: first we need to turn up the optimiser to , and then we need to tell the compiler to target a CPU that has the appropriate SIMD instruction with something like .-O3-march=skylake56
Our inner loop has gotten a little more complex, but spot the cool part:
With a very similar number of instructions we're now processing 8 integers at a time! This is the power of SIMD - we'd likely be limited by the bandwidth to memory rather than the housekeeping of working out what to do with them.
There are a couple of things to talk about though. Firstly, what's all this "mask move" stuff? Well, with SIMD we can act on "multiple data" but only with a "single instruction". Our original code has a conditional update: not every array element is necessarily processed in the same way. The compiler has cleverly used a "mask move" to let it write back only the elements that are "greater than", which is pretty clever of it. We can help it if we don't mind unconditionally writing to memory:conditionally
By unconditionally writing to we can reduce the loop to:x[i]7
The compiler has spotted our ternary operator is actually exactly the same as the built-in vector "max" instruction! No explicit comparisons or branches now - fantastic.
There's one last thing to cover - the eagle-eyed amongst you might have spotted the fact there are actually loops, one using vector instructions, and one doing regular one-at-a-time operations. The regular loop is jumped to conditionally after this check:two
So what's going on here? We're checking and - not the of the two arrays, but the addresses. That somewhat awkward sequence is essentially saying "do the two arrays overlap with each other in a way that would prevent us being able to vectorise". If they overlap by 28 or more bytes, then we can't read and write in 8 (32-byte) chunks: working in batches like this would give different results as the overlap means writing to might affect (or similar). So, the compiler has to add this check generate a fall-back case that does one at a time.xyintx[i]y[i+2]valuesand
With some non-standard trickery we tell the compiler to ignore vector dependencies for the loop:8
That's it for today: with the right flags and a little care about data layout, the compiler can turn your one-at-a-time loops into batch-processing powerhouses! Tomorrow we'll look at floating point vectorisation and its own peculiar quirks.
See that accompanies this post.the video
This post is day 20 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025
← | →Chasing your tailWhen SIMD Fails: Floating Point Associativity
This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt
.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop
.L3:movedx,DWORDPTR[rsi+rax*4]; read y[i]cmpedx,DWORDPTR[rdi+rax*4]; compare with x[i]jle.L2; if less, skip next instrmovDWORDPTR[rdi+rax*4],edx; x[i] = y[i].L2:addrax,1; ++icmprax,65536; are we at 65536?jne.L3; keep looping if not.L4:; Reads 8 integers into ymm1, y[i..i+7]vmovdquymm1,YMMWORDPTR[rsi+rax]; Compares that with 8 integers x[i..i+7]vpcmpgtdymm0,ymm1,YMMWORDPTR[rdi+rax]; Were all 8 values less or equal?vptestymm0,ymm0; if so, skip the writeje.L3; mask move x[i..i+7] with the y values that were greatervpmaskmovdYMMWORDPTR[rdi+rax],ymm0,ymm1.L3:addrax,32; move forward 32 bytescmprax,262144; are we at the end? (4*65536)jne.L4; keep looping if not.L3:; Read 8 integers from x[i..i+7]vmovdquymm0,YMMWORDPTR[rdi+rax]; "max" the integers with y[i..i+7]vpmaxsdymm0,ymm0,YMMWORDPTR[rsi+rax]; Store them backvmovdquYMMWORDPTR[rdi+rax],ymm0addrax,32; move forward 32 bytescmprax,262144; are we at the end?jne.L3; keep looping if notlearax,[rdi-4]; rax = x - 4subrax,rsi; rax = x - 4 - ycmprax,24; compare rax with 24moveax,0; (set up loop counter for below)jbe.L2; if rax <= 24; jump to the slow case; falls through to the vectorised loop hereI spent some of the late 90s and early 2000s writing video games, whose main rendering loops need to do a lot of matrix maths on a lot of 3d points. Writing big chunks of SIMD-like assembly code was a big part of that. You can see some of my code from that era in the . Red Dog source↩
See Mike Acton's seminal video on Data Oriented Design↩
I mean, it's also called "vectorisation" for a reason, eh? This has great benefits for caching too: a long sequence is best-case for the automatic prefetchers in modern CPUs. ↩
In "real" code we'd make a template parameter, but without an instantiation we'd see no code. To keep things brief, I'm hardcoding it to a constant here.
Size↩Or, more targetedly, use
-ftree-loop-vectorize↩Almost every CPU made in the last decade supports it, but the default target of GCC doesn't have the masked instructions the compiler wants to use. ↩
This be OK if we think we'll need to update many of the x array elements anyway. If we expect not to update much of then this could be a pessimisation as we unconditionally write to the whole array now. might
x↩Ideally we'd have a way in standard to say "these don't overlap". Try as I might, without making things a lot more complex, neither on pointers derived from the span nor could convince the compiler that it didn't need to be conservative. Splitting the function into two, one taking pointers and the other passing did work but was horrible.
__restrict[[assume(...)]]__restrict__x.data()↩