RSS Bilingual Reader

Written by me, proof-read by an LLM. Details at end.

Look back at - there's an optimisation I completely glossed over. Let me show you what I mean:our simple loop example

On every loop iteration we are calling to compare the index value, and to check if the index has reached the end of the vector. However, looking in the assembly, the compiler has pulled the size calculation out of the loop entirely - the that divides the "end - start" value by the size of an int is only performed the loop! The compiler quietly rewrote our code to be:vec.size()sar¹before

Such a transformation is called Loop-Invariant Code Motion, or LICM. It's not just expressions in the clauses themselves, any code inside the loop that the compiler can prove doesn't depend on which iteration it's in is fair game.for

Let's give our loop something more to do: We'll take a and count how many characters fall within a given range (being "capital letters" or "numerics"). We'll write it naively, using a function that returns the min and max character "in range" as a pair:std::string_viewget_range()²³

You can see that clang here has realised the range from cannot change during the loop and so has moved it outside:⁴get_range

Clang has also played some other neat tricks here to avoid a branch inside the loop using and to get a 0 or 1 based on a condition code.⁵setlesetge

Usually I end with a "trust the compiler" type sentiment. However, in this case I was surprised that gcc wasn't able to perform code motion here. I guess that's what is for: Trust the compiler, but know how to verify its output too.⁶Compiler Explorer

See that accompanies this post.the video

This post is day 13 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →Unswitching loops for fun and profit When LICM fails us

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the Patreon GitHub Compiler Explorer Shop

count_in_range:; ...preamble removed for brevity...callget_range(RangeType)@PLT; call get_range (OUTSIDE OF LOOP)movsxecx,al; ecx = range.firstshreax,8movsxedx,al; edx = range.secondxoresi,esi; esi = loop counter = 0xoreax,eax; eax = num = 0.LBB0_4:movsxedi,byteptr[rbx+rsi]; read next ccmpecx,edi; compare with firstsetler8b; r8b = c >= firstcmpedx,edi; compare with secondsetgedil; dil = c <= secondanddil,r8b; dil = dil & r8b; ie 1 if c in range, else 0movzxedi,dil; edi = (zero extended) diladdrax,rdi; num += (1 if in range, else 0)incrsi; increment loop countercmpr14,rsi; are we done?jne.LBB0_4; loop if not; ...postamble/return removed for brevity...

Recall that the vector holds various pointers, and to determine the count of elements, we need to subtract the from the and divide by the element size. endstart↩
A is, in this case, equivalent to . std::string_viewstd::span<char>↩
I haven't actually supplied the body of the function here - it would add more code would show the compiler doing inlining, which I'll cover properly later in this series. However, I do need to use a magic attribute here to let the compiler know that the implementation of this function is "pure" and depends only on its inputs. The compiler usually figures this out for itself from the implementation if it can see it. get_range[[gnu::pure]]and↩
Surprisingly, gcc seems unable to take advantage of this optimisation and in fact calls each iteration, even if I use the even-stricter attribute. get_range[[gnu::const]]twice↩
I've deliberately used only here to minimise some of the tricks clang will do here. I will cover some of those later in this series. -O1even more clever↩
After speaking with some other C++ experts, I filed a on a similar but largely equivalent example. A gcc maintainer suggests it's due to the structure type returned (a here) doesn't go through common subexpression elimination (CSE), which prevents the analysis needed to do LICM. gcc bug ↩std::pair⁷
We won't cover CSE in this series, but the compiler tries to detect redundant calculation of the same subexpressions, and only does the work once where possible. ↩