RSS Bilingual Reader

Written by me, proof-read by an LLM. Details at end.

After exploring SIMD vectorisation over the last of , let's shift gears to look at another class of compiler cleverness: memory access patterns. String comparisons seem straightforward enough - check the length, compare the bytes, done. But watch what Clang does when comparing against compile-time constants, and you'll see some rather clever tricks involving overlapping memory reads and bitwise operations. What looks like it should be a call to becomes a handful of inline instructions that exploit the fact that the comparison value is known at compile time.couple daysmemcmp¹

I've set up nine functions that each compare a against a constant string of increasing length, from one to nine characters. This gives us a chance to see how the compiler's approach changes based on the length of the comparison.std::string_view

As we learned when looking at , a is a pointer and a length, passed in two registers on x86 Linux. Each of these functions receives a length in and a pointer in . One might reasonably expect a call to , but the compiler has both and specialised the comparison for each constant string. Let's take a look at some of these comparison functions, starting with :calling conventions inlinedstd::string_viewstd::size_trdiconst char *rsimemcmpt1²

We see the length is checked first, and if it's not 1, then we return. Otherwise, we check the one character to see if it's or not, and then set the return value accordingly. The compiler has used a conditional set instruction to avoid a second branch.Asete

The pattern holds for power-of-two sizes: Looking at , and we see that the compiler does the same length check, and then cleverly realises it can compare a 2, 4 or 8-byte value directly with a constant of either , or (mouse over the constants in the view to see Compiler Explorer interpret them as ASCII).t2t4t8ABABCDABCDEFGH

Things get more interesting with the 7 character case, :t7

The check for the length is the same as the other cases, but once we know we're going to be comparing 7 bytes, some cunning tricks come into play. First, the compiler isn't directly comparing, as you might expect: It uses the fact that XORing identical values will result in a zero. Secondly, it has used two reads - reading bytes 0,1,2,3 and then 3,4,5,6. The redundant read of byte 3 doesn't matter, but doing two 32-bit reads is cheaper than having to read individual bytes.overlapping

Once the two XORs have happened, we have "zero only if first four bytes match ABCD" in and "zero only if bytes 3,4,5,6 match DEFG" in . Simply logical-ORing the two together gives us zero if and only if both were zero - only if all bytes matched. Then a simple turns the "zero flag" into either 0 or 1 for the / return value needed. Cute!eaxecxsetetruefalse

This optimisation works well on x86 as reading unaligned 32-bit values is free. You can play around with the compiler choice and see what neat tricks are conjured up by different compilers and architecture choices.

And that's what makes modern compilers remarkable - all this cleverness is conjured up from a simple . The overlapping reads, the XOR operations, the branchless conditionals - they're all applied automatically. Your job is to write clear code; the compiler's job is to make it fast. Leave it to do its thing, and try not to get in its way!sv == "ABCDEFG"sv

See that accompanies this post.the video

This post is day 22 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025

← | →When SIMD Fails: Floating Point Associativity Switching it up a bit

This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt

.Support Compiler Explorer on or , or by buying CE products in the Patreon GitHub Compiler Explorer Shop

t1:cmprdi,1; is length 1?jne.LBB0_1; if not 1, goto "return false"cmpbyteptr[rsi],65; is the byte 65 ('A')?seteal; set result to 0 or 1 accordinglyret; return.LBB0_1:xoreax,eax; set result to falseret; return

t7:cmprdi,7; is length 7?jne.LBB6_1; if not, goto "return false"moveax,1145258561; set eax to "ABCD"xoreax,dwordptr[rsi]; eax ^= first four chars of svmovecx,1195787588; set ecx to "DEFG"xorecx,dwordptr[rsi+3]; ecx ^= chars 3,4,5,6 of svorecx,eax; ecx |= eaxseteal; result = 1 if "zero flag" else 0ret; return

GCC generates more obvious, but slightly worse code, with some unnecessary logic operations. I to investigate. filed a bug ↩
libstdc++'s is defined as , which is why we see length in before pointer in . std::string_viewrdirsilength then pointer ↩

Clever memory tricks