Unrolling loops
Written by me, proof-read by an LLM.
Details at end.
A common theme for helping the compiler optimise is to give it as much information as possible. Using the , targeting the right CPU model, keeping , and for today's topic: telling it how many loop iterations there are going to be ahead of time.right signedness of typesloop iterations independent
Taking the range-based sum example , but using a , we can explore this ability. Let's take a look at what happens if we use a dynamically-sized span - we'd expect it to look very similar to the version:from our earlier post on loopsstd::spanstd::vector1
The compiler doesn't know how many ints there will be ahead of time, and it generates pretty straightforward code:3
A simple modification to the code to pass a , so the compiler now knows it will always loop eight times:std::span<int, 8>
The compiler takes advantage of this knowledge by unrolling the loop - it compiles the code as if all eight iterations of the loop had been written out one after another, avoiding the loop counter, and the conditional branch. That saves two instructions per loop iteration (the and the ), and also allows the compiler to spot more patterns: We see that it loads the first two values at once using the "load multiple" instruction.subsbneldmib4
Try changing the in the example above to other values and you'll see the variety of different ways the compiler chooses to implement the loop. At 32 iterations it gets quite register-happy, even spilling onto the stack briefly to get more registers to load. At 50 iterations the optimiser gives up and falls back to regular looping.85
Compilers will sometimes partially unroll loops (chunking up unrolled sections in a larger loop), or even speculatively unroll when the count isn't known ahead of time. There's a ton of heuristics at play, and honestly, the compiler's guess is usually pretty good. But it's worth checking your hot loops to see what it's doing - and if you can give it the loop count at compile time, you're setting it up for success.6
See that accompanies this post.the video
This post is day 10 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025
← | →Induction variables and loopsPop goes the...population count?
This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt
.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop
.LBB0_2ldrr3,[r2],#4; value = *ints++subsr1,r1,#4; count down remaining bytesaddr0,r3,r0; sum = value + sumbne.LBB0_2; loop if no more bytesA span is a "pointer and a length", representing a contiguous array of values. The length can optionally be compile-time known, so is dynamically sized, but is a span of 8 integers.
std::span<int>std::span<int, 8>↩I'm using this older ARM on purpose: it doesn't have fancy-pants vector instructions, which I'll cover later. We can see the loop optimisation on this simple example without introducing lots of complexity. We'll get there, I promise! ↩
Here the compiler has done something unusual - it multiplies up the size by 4 ( in the preamble), and then counts down in fours. I don't know why it doesn't realise it can avoid the shift, and then just count down in ones.
lsl r1, r1, #2↩While it could load even more registers in one go, the compiler has made the tradeoff of getting two reads at once before starting a pattern of loading and adding, to take advantage of the instruction-level parallelism this unlocks. ↩
I'm a little surprised/disappointed that it doesn't instead "chunk up" the loop into fixed-size blocks of say 16, and then loop over those 4 times to get 48, then add the last two. We'll see ways the compiler might choose to "chunk" later in the series with auto-vectorisation. ↩
Things like profile-guided optimisation (PGO) can help the compiler check its guesses are on-track: You build your program with instrumentation, run it with representative data, then feed the instrumentation output back to the compiler. Some of that data will include loop counts, which can help the optimiser. I won't be covering PGO in this series, but it's worth a look. ↩