Pop goes the...population count?

xania.org2025年12月11日 12:00

Written by me, proof-read by an LLM. Details at end.

Who among us hasn't looked at a number and wondered, "How many one bits are in there?" No? Just me then?

Actually, this "population count" operation can be pretty useful in some cases like data compression algorithms, , and . How might one write some simple C to return the number of one bits in an unsigned 64 bit value?cryptography, chess, error correctionsparse matrix representations

One way might be to loop 64 times, checking each bit and adding one if set. Or, equivalently, shifting that bit down and adding it to a running count: sometimes the population count operation is referred to as a "horizontal add" as you're adding all the 64 bits of the value together, horizontally. There are "divide and conquer" approaches too, see the amazing for a big list.Stanford Bit Twiddling Hacks page

My favourite way is to loop while the value is non-zero, and use a cute trick to "clear the bottom set bit". The loop count is then the number of set bits. How do you clear the bottom set bit? You a value with itself decremented!and

If you try some examples on paper, you'll see that subtracting one always moves the bottom set bit down by one place, setting all the bits from there down. Everything else is left the same. Then when you , the bottom set bit is guaranteed to be -ed with zero, but everything else remains. Great stuff!andand

All right, let's see what the compiler makes of this:

The core loop is pretty much what we'd expect, using the to get , ing and counting:lea trickvalue - 1and

Great stuff, but we can do better. By default gcc and clang both target some kind of "generic" processor which influences which instructions they can use. We're compiling for Intel here, and gcc's default is somewhere around Intel's "nocona" architecture, from 2004. Unless you are running vintage hardware you can probably change it to something better. Let's pick the super up-to-date "westmere" (from 2010...) using and see what happens:-march=westmere1

Wow! The entire routine has been replaced with a - . When I first saw this optimisation I was blown away: the compiler recognises a relatively complex loop as being functionally equivalent to a single instruction. Both gcc and clang can do this, and within Compiler Explorer you can use the optimisation pipeline viewer in clang to see that clang's "loop deletion pass" is responsible for this trick:single instructionpopcnt rax, rdi3

Compiler Explorer's Screenshot of CE showing the opt pipeline viewer with the loop being replaced with a call to @llvm.ctpop.i64Opt Pipeline View

Compilers canonicalise code too, so some similar population count code will also be turned into a single instruction, though sadly not all. In this case, it's probably better to actually use a standard C++ routine to guarantee the right instruction as well as reveal your intention: . But even if you don't, the compiler might just blow your mind with a single instruction anyway.std::popcount

See that accompanies this post.the video

This post is day 11 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →Unrolling loopsUnswitching loops for fun and profit

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop

value       : 11010100
subtract 1  : 11010011
& value     : 11010000
.L3:learax,[rdi-1]; rax = value - 1addedx,1; ++resultandrdi,rax; value &= value - 1jne.L3; ...while (value)

  1. Interestingly, if you select other architectures like you'll see the compiler emits a seemingly redundant the . This is to work around a performance bug in the CPU! -march=sandybridgexor eax, eaxpopcnt eax, edibefore2

  2. The register renamer / dependency tracking in the out of order engine incorrectly believes that has a dependency on the previous value of the destination register. The breaks this dependency. popcntxor eax, eax

  3. While this post is using Intel, ARM has , and some RISC-V variants have , which do the same thing. popcountcpop