Chasing your tail
Written by me, proof-read by an LLM.
Details at end.
Inlining is fantastic, as we've . There's a place it surely can't help though: recursion! If we call our own function, then surely we can't inline...seenrecently
Let's see what the compiler does with the classic recursive "greatest common divisor" routine - surely it can't avoid calling itself?
And yet: The compiler is able to avoid the recursion! When a function ends by calling another function (including itself), the compiler is often able to replace the call with a jump to that function. This is called "tail call optimisation". If that function is itself, the "jump to the start" is effectively the same as a loop. Our compiler turns our recursive function into a loop:
Pretty neat, eh? In this particular case the fact that we are recursing also unlocked further optimisations, but in general, tail recursion is useful to save a few instructions and some stack space.21
Sometimes we need to guarantee that tail calls happen. Clang has a way to force tail call optimisation or error if it's not possible: the annotation. In our gcd example, without tail-call optimisations we need a lot more stack space. To guarantee that tail calls happen (even in debug mode) we can specify this annotation.3[[clang::musttail]]
Tail call optimisation saves stack space, enables loop optimisations, and can even make your branch predictor happier. All this from a compiler noticing your function ends with a call and thinking "why bother coming back?" Sometimes the best optimisation is to just keep forging ahead!
One last point worth mentioning, tying back to : if you arrange for your tail-called function's arguments to be in the same order (or partial order) as the calling function's parameters, the compiler can be super efficient. Compare these two functions:calling conventions
Our same order function simply jumps on to even though we take too - the compiler can ignore it without issue. For the switched args version, the compiler has to emit instructions to switch parameters around, which obviously is more work. Sometimes that's unavoidable, of course, but it's worth keeping in mind when designing your function interfaces.wrapped_funcz
: This section explores a more advanced application of TCO. If you're feeling overwhelmed, feel free to skip it - the core TCO concepts are all covered above.Note
One surprising application of tail call optimisation is in CPU emulators. Most emulators use a statement inside a loop - fetch an opcode byte, switch on it to handle the instruction, repeat. That's simple and works well.4switch
But there's an alternative approach using tail calls:5
Each instruction handler calls , which looks up and calls the next handler. This looks like unbounded recursion! But the compiler inlines and uses TCO, turning it into a jump. Each handler ends with an indirect through the opcode table - no stack growth at all.next_opcode()next_opcode()jmp
Why bother with this complexity? Branch prediction! Modern CPUs predict branches based on history and branch path, indexed by each branch location. With a statement, there's branch trying to predict which of 256 instructions comes next - an impossible task. In my emulators, that single branch accounts for a double-digit percentage of mispredictions.switchoneall
With the TCO approach, each of the 256 instruction handlers has its own branch, each with its own predictor state. The branch at the end of the 6502 can learn "often followed by ", while the branch at the end of learns different patterns. This gives the CPU's branch predictor much better information to work with.CMPBNELDA
It's a niche optimisation, but a neat example of how TCO enables techniques that would otherwise be impractical!
See that accompanies this post.the video
This post is day 19 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025
← | →Partial inliningSIMD City: Auto-vectorisation
This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt
.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop
gcd(unsignedint,unsignedint):moveax,edi; eax (return val) = atestesi,esi; is b == 0?je.LBB0_4; if b is zero, jump to the endmovedx,esi; edx = b.LBB0_2:movecx,edx; ecx = bxoredx,edx; edx = 0 (set up for divide)divecx; eax = edx:eax (0:a) / ecx (b); edx = edx:eax (0:a) % ecx (b)moveax,ecx; eax = btestedx,edx; was a % b == 0?jne.LBB0_2; if not, keep loopingmoveax,ecx; eax (return val) = b.LBB0_4:retInterestingly, gcc generated a number of redundant extra instructions, so I ended up filing a ! bug↩
Similar optimisations can take a recursive factorial () into iteration too. Compilers are fab.
return x * fact(x-1);↩TCO is most efficient when the tail-called function's parameters happen to already be in the right registers. As we saw when looking at , which parameters end up in registers (and which registers) is determined by the ABI. calling conventions↩
This trick applies to bytecode interpreters too: LUA, DWARF debug info, stack unwinding, etc. My emulators: (BBC Micro), (Sega Master System), (ZX Spectrum in C++/WASM). jsbeebMiraclespecbolt↩
This example is simplified to illustrate the concept - a real implementation would handle state more carefully. ↩