When SIMD Fails: Floating Point Associativity

xania.org2025年12月21日 12:00

Written by me, proof-read by an LLM. Details at end.

we saw SIMD work beautifully with integers. But floating point has a surprise in store. Let's try summing an array:Yesterday1

Looking at the core loop, the compiler has pulled off a clever trick:

The compiler is using a vectorised add instruction which treats the as 8 separate integers, adding them up individually to the corresponding elements read from . That's incredibly efficient, processing 8 integers per loop iteration, but at the end of that loop we'll have 8 separate subtotals. The compiler adds a bit of "fix up" code after the loop to sum all these 8 subtotals:ymm0[rdi]

This sequence repeatedly adds the "top half" of the result to the bottom half until there's only one left. This fix up code is a little bit extra work, but the efficiency of the loop makes up for it.2

Let's switch to s and see what happens:float3

We're still processing 32 bytes' worth of s per loop iteration, but something unfortunate is going on:float4

The compiler has chosen to do 8 floats per loop iteration...but then proceeded to do each add individually. What has happened here?

To understand we need to recall that maths on s is special: Operations are not associative. When adding integers, you can regroup the operations however you like— gives the same result as . Not so with s: after each operation, rounding occurs. Depending on the relative magnitudes of , and , regrouping the additions will change the result.float(x + y) + zx + (y + z)floatxyz

When our compiler rewrote our integer sum loop into eight subtotals to take advantage of SIMD, and then summed the subtotals, it was regrouping the additions. We wrote a loop that summed from the first element to the last, and the compiler changed it to "accumulate every 8th element in 8 separate groups, then add those together". That works for integers but not floating point numbers.

So - are we stuck? Perhaps we know we don't care about associativity? Some of you may be itching to use the or flag here. Both flags give the compiler leeway to globally ignore the rules about floating-point maths and both will work. However, there's a better, more targeted way:-Ofast-funsafe-math-optimizations56

In this example, we tell GCC to "assume maths is associative" for just the function by using an annotation. That means the effects are limited to just that one function, and will be applied consistently regardless of how this function is compiled or used. Yes; it does rely on a GCC extension, which is unfortunate. In theory we can use with an unsequenced execution policy, but in my testing that didn't work here.sumstd::reduce

Floating point maths has some unusual and perhaps surprising gotchas. Knowing about them can help you work with the compiler to get super fast vectorised code.

See that accompanies this post.the video

This post is day 21 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →SIMD City: Auto-vectorisationClever memory tricks

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop

.L2:; pick up 8 integers and add them element-wise to ymm0vpadddymm0,ymm0,YMMWORDPTR[rdi]addrdi,32; move to the next batch of integerscmprax,rdi; at the end?jne.L2; loop if not
vextracti128xmm1,ymm0,0x1; xmm1 = ymm0 >> 128vpadddxmm0,xmm1,xmm0; xmm0 += xmm1vpsrldqxmm1,xmm0,8; xmm1 = xmm0 >> 64vpadddxmm0,xmm0,xmm1; xmm0 += xmm1vpsrldqxmm1,xmm0,4; xmm1 = xmm0 >> 32vpadddxmm0,xmm0,xmm1; xmm0 += xmm1vmovdeax,xmm0; return xmm0
.L2:vaddssxmm0,xmm0,DWORDPTR[rdi]; xmm0 += first elemaddrdi,32; move to next 8 floatsvaddssxmm0,xmm0,DWORDPTR[rdi-28]; xmm0 += secondvaddssxmm0,xmm0,DWORDPTR[rdi-24]; etcvaddssxmm0,xmm0,DWORDPTR[rdi-20]; ...vaddssxmm0,xmm0,DWORDPTR[rdi-16]vaddssxmm0,xmm0,DWORDPTR[rdi-12]vaddssxmm0,xmm0,DWORDPTR[rdi-8]vaddssxmm0,xmm0,DWORDPTR[rdi-4]; xmm0 += eighthcmprax,rdi; at the end?jne.L2; loop if not

  1. Yes, I know, this is what and similar are for, and you'll see similar code generation if you use them. std::accumulate

  2. Especially in this situation where the compiler can see we have 65536 elements to process. If we had a variable number of elements, the compiler might add code to conditionally vectorise based on the number - which it has to do anyway to check if there are fewer than 8 elements, anyway. 

  3. If you go back to yesterday's post and switch to s, you'll see the compiler is able to with vectors just as well as integers. floatmax

  4. My heuristic to see if the compiler has vectorised is to find the loop and then look at how much the compiler adds to the loop counter: it's a good starting point for telling if we're doing more than one element at a time. In this case though, it's slightly misleading. 

  5. My favourite flag to the compiler, though it is neither fun nor safe. 

  6. You can edit the flags in the example and see it work. However, both affect the program/compiled translation unit globally. That can lead to some unfortunate side effects outside of the one or two functions you want this applied to. Chandler Carruth refers to this as "unbounded precision loss" and he's not wrong. actually