Pop goes the...population count?
Written by me, proof-read by an LLM.
Details at end.
Who among us hasn't looked at a number and wondered, "How many one bits are in there?" No? Just me then?
Actually, this "population count" operation can be pretty useful in some cases like data compression algorithms, , and . How might one write some simple C to return the number of one bits in an unsigned 64 bit value?cryptography, chess, error correctionsparse matrix representations
One way might be to loop 64 times, checking each bit and adding one if set. Or, equivalently, shifting that bit down and adding it to a running count: sometimes the population count operation is referred to as a "horizontal add" as you're adding all the 64 bits of the value together, horizontally. There are "divide and conquer" approaches too, see the amazing for a big list.Stanford Bit Twiddling Hacks page
My favourite way is to loop while the value is non-zero, and use a cute trick to "clear the bottom set bit". The loop count is then the number of set bits. How do you clear the bottom set bit? You a value with itself decremented!and
If you try some examples on paper, you'll see that subtracting one always moves the bottom set bit down by one place, setting all the bits from there down. Everything else is left the same. Then when you , the bottom set bit is guaranteed to be -ed with zero, but everything else remains. Great stuff!andand
All right, let's see what the compiler makes of this:
The core loop is pretty much what we'd expect, using the to get , ing and counting:lea trickvalue - 1and
Great stuff, but we can do better. By default gcc and clang both target some kind of "generic" processor which influences which instructions they can use. We're compiling for Intel here, and gcc's default is somewhere around Intel's "nocona" architecture, from 2004. Unless you are running vintage hardware you can probably change it to something better. Let's pick the super up-to-date "westmere" (from 2010...) using and see what happens:-march=westmere1
Wow! The entire routine has been replaced with a - . When I first saw this optimisation I was blown away: the compiler recognises a relatively complex loop as being functionally equivalent to a single instruction. Both gcc and clang can do this, and within Compiler Explorer you can use the optimisation pipeline viewer in clang to see that clang's "loop deletion pass" is responsible for this trick:single instructionpopcnt rax, rdi3
Compiler Explorer's
Opt Pipeline View
Compilers canonicalise code too, so some similar population count code will also be turned into a single instruction, though sadly not all. In this case, it's probably better to actually use a standard C++ routine to guarantee the right instruction as well as reveal your intention: . But even if you don't, the compiler might just blow your mind with a single instruction anyway.std::popcount
See that accompanies this post.the video
This post is day 11 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025
← | →Unrolling loopsUnswitching loops for fun and profit
This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt
.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop
value : 11010100
subtract 1 : 11010011
& value : 11010000
.L3:learax,[rdi-1]; rax = value - 1addedx,1; ++resultandrdi,rax; value &= value - 1jne.L3; ...while (value)Interestingly, if you select other architectures like you'll see the compiler emits a seemingly redundant the . This is to work around a performance bug in the CPU!
-march=sandybridgexor eax, eaxpopcnt eax, edibefore2↩The register renamer / dependency tracking in the out of order engine incorrectly believes that has a dependency on the previous value of the destination register. The breaks this dependency.
popcntxor eax, eax↩While this post is using Intel, ARM has , and some RISC-V variants have , which do the same thing.
popcountcpop↩