xania.org
2025 in Review
Written by me, proof-read by an LLM. Details at end. 2025 has been quite a year for me. The big ticket things for me were having the majority of the year on a non-compete, a new job, and some videos and conference talks. It was a bumper year for my public talks, which included: I also appeared in a number of : Computerphile videos On the front, I finally solved a three-year-old problem with — our new content-addressable filesystem that mounts compiler images on demand instead of all 2,000+ at bo...
Thank you
Written by me, proof-read by an LLM. Details at end. It's the 25th! Whatever you celebrate this time of year, I wish you the very best and hope you are having a lovely day. For me, this is a family time: I'm not at all religious but was brought up to celebrate Christmas. So, today we'll be cooking a massive roast dinner and enjoying family time. 1 This series was an idea I had around this time last year, and it has been a substantial amount of work. I've really enjoyed writing it, and seeing the...
When compilers surprise you
Written by me, proof-read by an LLM. Details at end. Every now and then a compiler will surprise me with a really smart trick. When I first saw this optimisation I could hardly believe it. I was looking at loop optimisation, and wrote something like this simple function that sums all the numbers up to a given value: So far so decent: GCC has done some preliminary checks, then fallen into a loop that efficiently sums numbers using (we've ). But taking a closer look at the loop we see something un...
Switching it up a bit
Written by me, proof-read by an LLM. Details at end. The standard wisdom is that switch statements compile to jump tables. And they do - when the compiler can't find something cleverer to do instead. Let's start with a really simple example: Here the compiler has spotted the relationship between and the return value, and rewritten the code as: - pretty neat. No jump table, just maths! x if (x < 5) return (x+1) * 100; else return 0; If we mix up the code a bit so there's no obvious relationshi...
Clever memory tricks
Written by me, proof-read by an LLM. Details at end. After exploring SIMD vectorisation over the last of , let's shift gears to look at another class of compiler cleverness: memory access patterns. String comparisons seem straightforward enough - check the length, compare the bytes, done. But watch what Clang does when comparing against compile-time constants, and you'll see some rather clever tricks involving overlapping memory reads and bitwise operations. What looks like it should be a call t...
When SIMD Fails: Floating Point Associativity
Written by me, proof-read by an LLM. Details at end. we saw SIMD work beautifully with integers. But floating point has a surprise in store. Let's try summing an array: Yesterday 1 Looking at the core loop, the compiler has pulled off a clever trick: The compiler is using a vectorised add instruction which treats the as 8 separate integers, adding them up individually to the corresponding elements read from . That's incredibly efficient, processing 8 integers per loop iteration, but at the end o...
SIMD City: Auto-vectorisation
Written by me, proof-read by an LLM. Details at end. It's time to look at one of the most sophisticated optimisations compilers can do: autovectorisation. Most "big data" style problems boil down to "do this maths to huge arrays", and the limiting factor isn't the maths itself, but the feeding of instructions to the CPU, along with the data it needs. To help with this problem, CPU designers came up with SIMD: "Single Instruction, Multiple Data". One instruction tells the CPU what to do with a wh...
Chasing your tail
Written by me, proof-read by an LLM. Details at end. Inlining is fantastic, as we've . There's a place it surely can't help though: recursion! If we call our own function, then surely we can't inline... seen recently Let's see what the compiler does with the classic recursive "greatest common divisor" routine - surely it can't avoid calling itself? And yet: The compiler is able to avoid the recursion! When a function ends by calling another function (including itself), the compiler is often able...
Partial inlining
Written by me, proof-read by an LLM. Details at end. We've learned how important inlining is to optimisation, but also that it might sometimes cause code bloat. Inlining doesn't have to be all-or-nothing! Let's look at a simple function that has a fast path and slow path; and then see how the compiler handles it. 1 In this example we have some function that has a really trivial fast case for numbers in the range 0-100. For other numbers it does something more expensive. Then calls twice (making ...
Inlining - the ultimate optimisation
Written by me, proof-read by an LLM. Details at end. Sixteen days in, and I've been dancing around what many consider the fundamental compiler optimisation: inlining. Not because it's complicated - quite the opposite! - but because inlining is less interesting for what it does (copy-paste code), and more interesting for what it enables. Initially inlining was all about avoiding the expense of the call itself, but nowadays inlining enables many other optimisations to shine. 1 We've already encoun...
Calling all arguments
Written by me, proof-read by an LLM. Details at end. Today we're looking at calling conventions - which aren't purely optimisation related but are important to understand. The calling convention is part of the ABI (Application Binary Interface), and varies from architecture to architecture and even OS to OS. Today I'll concentrate on the System V ABI for x86 on Linux, as (to me) it's the most sane ABI. 1 Before we go on: I'd be remiss if I didn't point out that I can remember which register has ...
Aliasing
Written by me, proof-read by an LLM. Details at end. we ended on a bit of a downer: aliasing stopped optimisations dead in their tracks. I know this is supposed to be the , not the ! Knowing why your compiler can't optimise is just as important as knowing all the clever tricks it can pull off. Yesterday Advent of Compiler Optimisations Advent of Compiler Giving Up Let's take a simple example of a counter class. It accumulates integers into a member variable . I've used C++ templates to show two ...
When LICM fails us
Written by me, proof-read by an LLM. Details at end. ended with the compiler pulling invariants like and out of our loop - clean assembly, great performance. Job done, right? Yesterday's LICM post size() get_range Not quite. Let's see how that optimisation can disappear. Let's say you had a , and wanted to write a function to return if there was an exclamation mark or not: const char * 1 Here we're relying on loop-invariant code motion (LICM) to move the out of the loop, called once before we lo...
Loop-Invariant Code Motion
Written by me, proof-read by an LLM. Details at end. Look back at - there's an optimisation I completely glossed over. Let me show you what I mean: our simple loop example On every loop iteration we are calling to compare the index value, and to check if the index has reached the end of the vector. However, looking in the assembly, the compiler has pulled the size calculation out of the loop entirely - the that divides the "end - start" value by the size of an int is only performed the loop! The...
Unswitching loops for fun and profit
Written by me, proof-read by an LLM. Details at end. Sometimes the compiler decides the best way to optimise your loop is to... write it twice. Sounds counterintuitive? Let's change our to optionally return a sum-of-squares: sum example from before 1 At the compiler turns the ternary into: - using a multiply and add () instruction to do the multiply and add, and conditionally picking either or the constant to avoid a branch inside the loop. -O2 sum += value * (squared ? value : 1); mla value 1 H...
Pop goes the...population count?
Written by me, proof-read by an LLM. Details at end. Who among us hasn't looked at a number and wondered, "How many one bits are in there?" No? Just me then? Actually, this "population count" operation can be pretty useful in some cases like data compression algorithms, , and . How might one write some simple C to return the number of one bits in an unsigned 64 bit value? cryptography, chess, error correction sparse matrix representations One way might be to loop 64 times, checking each bit and ...
Unrolling loops
Written by me, proof-read by an LLM. Details at end. A common theme for helping the compiler optimise is to give it as much information as possible. Using the , targeting the right CPU model, keeping , and for today's topic: telling it how many loop iterations there are going to be ahead of time. right signedness of types loop iterations independent Taking the range-based sum example , but using a , we can explore this ability. Let's take a look at what happens if we use a dynamically-sized span...
Induction variables and loops
Written by me, proof-read by an LLM. Details at end. Loop optimisations often surprise us. What looks expensive might be fast, and what looks clever might be slow. we saw how the compiler canonicalises loops so it (usually) doesn't matter how you write them, they'll come out the same. What happens if we do something a little more expensive inside the loop? Yesterday Let's take a look at something that looks like it would need an "expensive" multiply each loop iteration: 1 Despite our "times tabl...
Going loopy
Written by me, proof-read by an LLM. Details at end. Which loop style is "best"? This very question led to the creation of Compiler Explorer! In 2011 I was arguing with my team about whether we could switch all our loops from ordinal or iterator-style to the "new" range-for. I wrote a small to iteratively show the compiler output as I edited code in , and the seed of was born. 1 shell script Compiler Explorer vim C and C++ have many ways to phrase loops: , , , range-for (), STL algorithms like ,...
Multiplying our way out of division
Written by me, proof-read by an LLM. Details at end. I occasionally give presentations to undergraduates, and one of my favourites is taking the students on a journey of optimising a "binary to decimal" routine. There are a number of tricks, which I won't go in to here, but the opening question I have is "how do you even turn a number into its ASCII representation?" 1 If you've never stopped to think about it, take a moment now to do so, it can be a fun problem. The simple approach is to use to ...