RSS Bilingual Reader

Written by me, proof-read by an LLM. Details at end.

We've learned how important inlining is to optimisation, but also that it might sometimes cause code bloat. Inlining doesn't have to be all-or-nothing!

Let's look at a simple function that has a fast path and slow path; and then see how the compiler handles it.¹

In this example we have some function that has a really trivial fast case for numbers in the range 0-100. For other numbers it does something more expensive. Then calls twice (making it less appealing to inline all of ).processcomputeprocessprocess

Looking at the assembly output, we see what's happened: The compiler has split into two functions, a that does the expensive part only. It then rewrites into the quick check for 100, returning double the value if less than 100. If not, it jumps to the function:processprocess (part.0)process(part.0)

This first step - extracting the cold path into a separate function - is called . The original becomes a thin wrapper handling the hot path, delegating to the outlined when needed. This split sets up the real trick: . When the compiler later inlines into , it inlines just the wrapper whilst keeping calls to the outlined cold path. External callers can still call and have it work correctly for all values.function outliningpartial inliningprocessprocess (.part.0)processcomputeprocess

Let's see this optimisation in action in the function:compute

Looking at , we can see the benefits of this approach clearly: The simple range check and arithmetic (, ) are inlined directly, avoiding the function call overhead for the fast path. When a value is 100 or greater, it calls the outlined function for the more expensive computation.computecmpleaprocess (.part.0)

This is the best of both worlds: we get the performance benefit of inlining the lightweight check and simple arithmetic, whilst avoiding code bloat from duplicating the expensive computation. The original function remains intact and callable, so external callers still work correctly.²process

Partial inlining lets the compiler make nuanced trade-offs about what to inline and what to keep shared. The compiler can outline portions of a function based on its heuristics about code size and performance, giving you benefits of inlining without necessarily paying the full code size cost. In this example, the simple check is duplicated whilst the complex computation stays shared.³

As with many optimisations, the compiler's heuristics usually make reasonable choices about when to apply partial inlining, but it's worth checking your hot code paths to see if the compiler has made the decisions you expect. Taking a quick peek in is a good way to develop your intuition.⁴Compiler Explorer

See that accompanies this post.the video

This post is day 18 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →Inlining - the ultimate optimisation Chasing your tail

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the Patreon GitHub Compiler Explorer Shop

process(unsignedint):cmpedi,99; less than or equal to 99?jbe.L7; skip to fast path if sojmpprocess(unsignedint)(.part.0); else jump to the expensive path.L7:leaeax,[rdi+rdi]; return `value * 2`ret

compute(unsignedint,unsignedint):cmpedi,99; is a <= 99?jbe.L13; if so, go to the inlined fast path for acallprocess(unsignedint)(.part.0); else, call expensive casemovr8d,eax; save the result of process(a)cmpesi,99; is b <= 99?jbe.L14; if so go to the inlined fast path for b.L11:movedi,esi; otherwise, call expensive case for bcallprocess(unsignedint)(.part.0)addeax,r8d; add the two slow cases togetherret; return.L13:; case where a is fast caselear8d,[rdi+rdi]; process(a) is just a + acmpesi,99; is b > 99?ja.L11; jump to b slow case if so; (falls through to...).L14:; b fast caseleaeax,[rsi+rsi]; double baddeax,r8d; return 2*a + 2*bret

I have had to cheat a little here to get the output I want: I've actually disabled GCC's main inlining pass, otherwise it chooses to inline the whole of . With a larger, more complex "slow path" that would be unnecessary, but in order to demonstrate the effect of partial inlining without generating tons of code, I'm using this slight cheat. process↩
Again, in this contrived example it probably be OK to inline , and the compiler really wants to, but for didactic purposes I've prevented that here. You can hopefully get the gist of this. wouldprocess↩
Of course, nothing is free - duplicating code still takes up instruction cache space. The compiler's heuristics have to weigh the benefits against the costs, and different compilers make different choices. ↩
Note that this varies substantially from compiler to compiler: I couldn't trick clang into making similar partial inlining decisions to gcc using flags, so I couldn't compare like with like. In my experience gcc and clang make quite different choices about inlining. ↩