Inlining - the ultimate optimisation

xania.org2025年12月17日 12:00

Written by me, proof-read by an LLM. Details at end.

Sixteen days in, and I've been dancing around what many consider the fundamental compiler optimisation: inlining. Not because it's complicated - quite the opposite! - but because inlining is less interesting for what it does (copy-paste code), and more interesting for what it enables.

Initially inlining was all about avoiding the expense of the call itself, but nowadays inlining enables many other optimisations to shine.1

We've already encountered inlining (though I tried to limit it until now): On to get the size of a vector, we called its method. I completely glossed over the fact that while is a method on , we don't see a call in the assembly code, just the subtraction and shift.day 8.size()size()std::vector

So, how does inlining enable other optimisations? Using ARMv7, let's convert a string to uppercase. We might have a utility function that either turns a single character from upper to lower, or lower to upper, so, we'll use it in our code:23change_case

The compiler decides to inline into , and then seeing that is , it can simplify the whole code to:54change_casemake_upperuppertruealways

There's no trace left of the case and the compiler, having inlined the code, has a fresh copy of the code to then further modify to take advantage of things it knows are true. It does a neat trick of avoiding a branch to check whether the character is uppercase: If is less than 26, it must be a lowercase character. It then conditionally subtracts 32, which has the effect of making into .!upper(c - 'a') & 0xffaA6

Inlining gives the compiler the ability to make local changes: The implementation can be special cased at the inline site as by definition there's no other callers to the code. The special casing can include propagating values known to be constants (like the bool above), and looking for code paths that are unused.upper7

Inlining has some drawbacks though: if it's overused, the code size of your program can grow quite substantially. The compiler has to make its best guess as to whether inlining a function (and the functions that calls...and so on), based on heuristics about the code size increase, and whether the perceived benefit is worth it. Ultimately it's a guess though.8it

In rare cases accepting the cost of calling a common routine be a benefit: if there is an unavoidable branch in the routine that's globally predictable, sometimes having one shared branch site can be better for the branch predictor. In many cases, though, the reverse is true: if there's a branch in code that's inlined many times across the codebase then sometimes the (more local) branch history for the many copies of that branch can yield more predictability. It's...complex.can9

An important consideration for inlining is the visibility of the of the function you're calling (that is, the body of the function). If the compiler has only seen the of a function (e.g. in the case above just ), then it can't inline it: there's nothing to inline! In modern C++ with templates and a lot of code in headers, this usually isn't a problem, but if you're trying to minimise build times and interdependency this can be an issue.definitiondeclarationchar change_case(char c, bool upper);10

Frustratingly, inlining is also one of the most heuristic-driven optimisations; with different compilers making reasonable but different guesses about which functions should be inlined. This can be frustrating when adding a single line to a function somewhere has ripple effects throughout a codebase affecting inlining decisions.11

All that said: Inlining is the ultimate enabling optimisation. On its own, copying function bodies into call sites might save a few cycles here and there. But give the compiler a fresh copy of code at the call site, and suddenly it can propagate constants, eliminate dead branches, and apply transformations that would be impossible with a shared function body. Who said copy paste was always bad?

See that accompanies this post.the video

This post is day 17 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →Calling all argumentsPartial inlining

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop

.LBB0_1:ldrbr2,[r0]; read next `c`; c = *string;subr3,r2,#97; tmp = c - 'a'uxtbr3,r3; tmp = tmp & 0xffcmpr3,#26; check tmp against 26sublor2,r2,#32; if lower than 26 then c = c - 32; c = ((c - 'a') & 0xff) < 26 ? c - 32 : c;strbr2,[r0],#1; store 'c' back; *string++ = csubsr1,r1,#1; reduce counterbne.LBB0_1; loop if not zero

  1. The and itself, coupled with the preserving and restoring of registers required for the calling convention. callreturn

  2. Again, this is for simplicity and to avoid vectorisation which, while super important, is something we'll get to later. 

  3. A bit contrived, and I'm deliberately rolling my own etc to avoid locales, and further function calls. std::toupper

  4. This is called constant propagation - when the compiler knows a value is constant, it substitutes it everywhere and simplifies the result. I had planned to do a post on this alone, but somehow I didn't have space for it! 

  5. I deliberately made the function so it won't even appear in the output here as it's otherwise unused. This also strongly hints to the compiler's optimiser to inline it. If I made it non-static, then doesn't change at all (it's still inlined), but there's a big (unused) in the output to confuse things. Give it a go and look at the more complex code generated in ! change_casestaticmake_upperchange_casechange_case

  6. The comes from the "Unsigned eXTend Byte" instruction. & 0xffuxtb

  7. Also known as dead code elimination, if the compiler can prove that parts of the code are unreachable it can remove them. 

  8. This can have performance effects: the instruction cache is a limited resource so filling it up with lots of copies of essentially the same code can put extra pressure on the memory and decode subsystem of the CPU. 

  9. That is, it's a nuanced trade-off that depends on the specific code and runtime patterns. The compiler often doesn't have that kind of information about your code, and so has to guess. Things like Profile Guided Optimisation can help this, but we won't be covering that in this series. 

  10. Using link-time optimisation (LTO; sometimes called link-time code generation), will allow the compiler to inline across translation units. LTO is a powerful technique and is well-supported on compilers. I always enable LTO for my release builds. 

  11. Quite famously, once a single newline in a function definition caused a , though this is more a limitation with the way GCC chooses to interpret inline assembly's cost during inline analysis. significant Linux performance regression