Unswitching loops for fun and profit

xania.org2025年12月12日 12:00

Written by me, proof-read by an LLM. Details at end.

Sometimes the compiler decides the best way to optimise your loop is to... write it twice. Sounds counterintuitive? Let's change our to optionally return a sum-of-squares:sumexample from before1

At the compiler turns the ternary into: - using a multiply and add () instruction to do the multiply and add, and conditionally picking either or the constant to avoid a branch inside the loop.-O2sum += value * (squared ? value : 1);mlavalue1

However, if we turn the optimisation level up, the compiler uses a new approach:

Here the compiler realises the value doesn't change throughout the loop, and decides to duplicate the loop: one copy that squares each time unconditionally, and one where it never multiplies at all. This is called "loop unswitching".bool squared

The check of is moved out of the loop, and the appropriate loop is then selected, either (non-squaring) or continuing to (the squaring version).squared.LBB0_4.LBB0_2

Each loop is perfectly optimised for its duties, and it's as if you had written:

By duplicating the loop this way, the compiler makes sure that the multiplication doesn't happen unless you specifically asked for it. In the code, the compiler bet on doing the multiply-and-add each time even when it wasn't needed. In the loop unswitching case, we do pay a code size penalty ( code is duplicated, after all), and that's why loop unswitching didn't occur on the lower optimisation setting.-O22some

As always, it's good to trust your compiler's decisions, but know the kinds of trade-offs it's making at various optimisation levels. You can always verify what it's doing with , after all!Compiler Explorer

See that accompanies this post.the video

This post is day 12 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →Pop goes the...population count?Loop-Invariant Code Motion

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop



  1. We're using ARMv7 again here for simplicity and to yet again avoid vectorisation that we'll get to later in this series. 

  2. In this case the compiler determined that just doing the multiply every time would be cheaper than a branch on every iteration to avoid it. Branch predictors do help this in general, but it's still better to do one branch at the beginning and specialise each loop individually.