Multiplying our way out of division
Written by me, proof-read by an LLM.
Details at end.
I occasionally give presentations to undergraduates, and one of my favourites is taking the students on a journey of optimising a "binary to decimal" routine. There are a number of tricks, which I won't go in to here, but the opening question I have is "how do you even turn a number into its ASCII representation?"1
If you've never stopped to think about it, take a moment now to do so, it can be a fun problem.
The simple approach is to use to get the rightmost digit (adding 48 to turn it into the ASCII number), then divide by ten, and keep going until the number is zero. This produces the digits backwards, but you can reverse them afterwards, which I won't show here. This routine is one of the few legitimate uses of in C, as we always want to emit at least one digit even if the number zero to start with. The code looks something like:23number % 10do whileis
Here the compiler does a fantastic job. we saw how division by powers of two can be optimised to shifts; today we'll see the compiler manages to avoid expensive division even when we dividing by a power of two. It also gets the remainder cheaply, which we usually get for free from the divide instruction.Yesterdayaren't
The transformation is quite clever - let's walk through the annotated assembly:
There's a lot to unpack here, several different optimisations, but the main one is how the compiler has turned division by a constant ten into a multiply and a shift. There's a magic constant and a shift right of 35! Shifting right by 35 is the same as dividing by 2 - what's going on?0xcccccccd354
Let's see what happens each step of the algorithm:
What's happening is that is very close to ⅒ (around 0.10000000000582077). By multiplying our input value by this constant first, then shifting right, we're doing fixed-point multiplication by ⅒ - which is division by ten. The compiler knows that for all possible unsigned integer values, this trick will always give the right answer. For other values and signednesses, sometimes it needs to account for rounding, for example, dividing a signed value by three:0xcccccccd / 2**35
Here we see that it has to account for rounding. If you edit the code above and try dividing by fifteen, you'll see that causes even more code to be emitted. However, it's all still faster than a real divide instruction.
Back to our ASCII conversion example, to get the remainder (the modulus); the compiler takes the (truncated) , multiplies it back up by 10 using tricks (we've covered this ), and then the difference between the original number and this computed value is the remainder.number / 10leabefore
The rest of the optimisations are the compiler trying to do work eagerly (like incrementing ), and checking one loop iteration ahead: there's no point looping if the current number is less than or equal to 9.buf
Overall, some very clever optimisations that avoid division entirely!
See that accompanies this post.the video
This post is day 7 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025
← | →DivisionGoing loopy
This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt
.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop
to_decimal_backwards(unsignedint,char*):movrax,rsi; rax = bufmovesi,3435973837; esi = 0xcccccccd.L2:movedx,edi; edx = numbermovecx,edi; ecx = numberaddrax,1; ++bufimulrdx,rsi; rdx *= 0xcccccccdshrrdx,35; rdx = rdx >> 35; rdx = number / 10 [see below]lear8d,[rdx+rdx*4]; r8 = rdx * 5addr8d,r8d; r8 = rdx * 5 * 2 = rdx * 10; r8 = (number / 10) * 10 [rounded down]subecx,r8d; ecx = number - (number / 10) * 10; ecx = number % 10addecx,48; ecx = '0' + (number % 10)cmpedi,9; number > 9?movedi,edx; number = number / 10movBYTEPTR[rax-1],cl; *(buf-1) = '0' + (number % 10)ja.L2; loop if number (prior to divide) >= 10ret>>>1234*0xcccccccd4239991714858>>>4239991714858//(2**35)123>>>123*101230>>>1234-12304A decade ago many trading systems still used to accept new orders, and that's essentially ASCII. Taking prices and quantities and quickly turning them into ASCII used to be important! FIX↩
Yes I know I said I wouldn't go into it, but the faster implementation I wrote uses bit trickery to get log(10) of the number, then dispatches via template trickery to a "N-digit" routine that then uses a 100-entry lookup table to look up pairs of digits. Fun stuff. ↩
If you can get the log(10) you can get the number of digits in the output, and then you can start at the right spot in the output buffer and write the number backwards. ↩
You might wonder why the compiler uses (signed multiply) even though we're working with unsigned values. The low 64 bits of the result are identical for both signed and unsigned multiplication, and has more flexible operand forms that let us specify three operands.
imulimul↩