RSS Bilingual Reader

Written by me, proof-read by an LLM. Details at end.

Which loop style is "best"? This very question led to the creation of Compiler Explorer! In 2011 I was arguing with my team about whether we could switch all our loops from ordinal or iterator-style to the "new" range-for. I wrote a small to iteratively show the compiler output as I edited code in , and the seed of was born.¹shell script Compiler Explorervim

C and C++ have many ways to phrase loops: , , , range-for (), STL algorithms like , and now range transformations! Let's see if loop style actually matters for performance.for()while()do..while()for (x : container)std::for_each

Let's look at similar functions that calculate the sum of a , using different looping strategies. First, let's use the ordinal-based approach:std::vector<int>

To explain the generated assembly code, we need to take a look at the innards of a . Internally, the vector holds three pointers: one to the beginning of the allocated data ("start"), one to the just past the end of the used data ("finish"), and a last one to the end of the storage (the space which we can grow into without reallocating). It does not explicitly store the size. Let's look at the first few instructions:std::vector²

The first thing we do is get the start and finish pointers of the vector, and then subtract them to work out the size (returning early if it's zero). So far so good! (The .L4 label just returns 0, so I'll ignore it in this description).

Next we see the loop itself:

Here we see the slightly bonkers x86 addressing mode we touched on in the which lets us read a value from memory calculated from the start plus the index (times four, as each is four bytes), add it in a single instruction. CISC, am I right?earlier lea postintand

Anyway, everything seems ok in this loop, right? Except, it bothers me that we calculate the size at all. I mean, we don't care what size it is, we just want to iterate over every element, right?really³

How about we rephrase this to use the pointers directly? We can fish out the start pointer with , and calculate the end pointer by adding the size (and hope the optimiser spots it doesn't need to use the size, it can just use the directly). What does that look like?.data()endactually

That's better! The optimiser avoided all the shifts and size calculation in the setup code, and the inner loop is simply reading and walking a pointer forward.muchhas

Any time you see C++ code with naked pointers in it, you should ask yourself if there's a better way, though. So, let's see what would happen if we used the fancy range-for:

the same code as our -based, pointer approach! The compiler canonicalised the range-for into identical assembly. Less error-prone pointer manipulation, more intention-revealing C++ code, the best code generation!Exactlywhileand

Even with a standard algorithm, the pattern continues:

Identical again! Canonicalisation rewrites all these different loop forms into the same basic setup, generating optimal code every time. Whether you write explicit loops or use standard algorithms, the compiler sees through to the underlying iteration pattern. Write clear, intention-revealing code - the optimiser has your back.⁴

See that accompanies this post.the video

This post is day 8 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →Multiplying our way out of division Induction variables and loops

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the Patreon GitHub Compiler Explorer Shop

movrsi,QWORDPTR[rdi]; rsi = vec->startmovrcx,QWORDPTR[rdi+8]; rcx = vec->finishsubrcx,rsi; rcx = finish - startje.L4; if zero; early returnsarrcx,2; rcx >>= 2 (divide by sizeof(int)); rcx = vec.size()xoreax,eax; eax = 0 (this will be "index")xoredx,edx; edx = 0 (this will be "sum")

.L3:addedx,DWORDPTR[rsi+rax*4]; edx += *(int*)(start + index*4)addrax,1; ++indexcmprax,rcx; if index == size?jb.L3; if not, loopmoveax,edx; return "sum"ret

That might seem odd - why would we expect it to be slower? In this instance we had just been bitten by a similar issue in Java - a range-for equivalent in Java creates a temporary iteration object each time around, and we were trying to write "garbage free" java. This caught us out, so the team was a bit sensitive to this kind of issue. ↩
At least, the GNU implementation. As best I know all standard library implementations follow this pattern, but I don't think the C++ standard requires them to work exactly this way. ↩
Honestly it also annoys me the register allocator didn't put into from the start to save on the at the end, but that's minor. sumraxmov eax, edx↩
With the exception of the ordinal-based one where the compiler doesn't work out it can drop the loop index. ↩

Going loopy