Clever memory tricks
Written by me, proof-read by an LLM.
Details at end.
After exploring SIMD vectorisation over the last of , let's shift gears to look at another class of compiler cleverness: memory access patterns. String comparisons seem straightforward enough - check the length, compare the bytes, done. But watch what Clang does when comparing against compile-time constants, and you'll see some rather clever tricks involving overlapping memory reads and bitwise operations. What looks like it should be a call to becomes a handful of inline instructions that exploit the fact that the comparison value is known at compile time.coupledaysmemcmp1
I've set up nine functions that each compare a against a constant string of increasing length, from one to nine characters. This gives us a chance to see how the compiler's approach changes based on the length of the comparison.std::string_view
As we learned when looking at , a is a pointer and a length, passed in two registers on x86 Linux. Each of these functions receives a length in and a pointer in . One might reasonably expect a call to , but the compiler has both and specialised the comparison for each constant string. Let's take a look at some of these comparison functions, starting with :calling conventionsinlinedstd::string_viewstd::size_trdiconst char *rsimemcmpt12
We see the length is checked first, and if it's not 1, then we return. Otherwise, we check the one character to see if it's or not, and then set the return value accordingly. The compiler has used a conditional set instruction to avoid a second branch.Asete
The pattern holds for power-of-two sizes: Looking at , and we see that the compiler does the same length check, and then cleverly realises it can compare a 2, 4 or 8-byte value directly with a constant of either , or (mouse over the constants in the view to see Compiler Explorer interpret them as ASCII).t2t4t8ABABCDABCDEFGH
Things get more interesting with the 7 character case, :t7
The check for the length is the same as the other cases, but once we know we're going to be comparing 7 bytes, some cunning tricks come into play. First, the compiler isn't directly comparing, as you might expect: It uses the fact that XORing identical values will result in a zero. Secondly, it has used two reads - reading bytes 0,1,2,3 and then 3,4,5,6. The redundant read of byte 3 doesn't matter, but doing two 32-bit reads is cheaper than having to read individual bytes.overlapping
Once the two XORs have happened, we have "zero only if first four bytes match ABCD" in and "zero only if bytes 3,4,5,6 match DEFG" in . Simply logical-ORing the two together gives us zero if and only if both were zero - only if all bytes matched. Then a simple turns the "zero flag" into either 0 or 1 for the / return value needed. Cute!eaxecxsetetruefalse
This optimisation works well on x86 as reading unaligned 32-bit values is free. You can play around with the compiler choice and see what neat tricks are conjured up by different compilers and architecture choices.
And that's what makes modern compilers remarkable - all this cleverness is conjured up from a simple . The overlapping reads, the XOR operations, the branchless conditionals - they're all applied automatically. Your job is to write clear code; the compiler's job is to make it fast. Leave it to do its thing, and try not to get in its way!sv == "ABCDEFG"sv
See that accompanies this post.the video
This post is day 22 of , a 25-day series exploring how compilers transform our code.Advent of Compiler Optimisations 2025
← | →When SIMD Fails: Floating Point AssociativitySwitching it up a bit
This post was written by a human () and reviewed and proof-read by LLMs and humans.Matt Godbolt
.Support Compiler Explorer on or , or by buying CE products in the PatreonGitHubCompiler Explorer Shop
t1:cmprdi,1; is length 1?jne.LBB0_1; if not 1, goto "return false"cmpbyteptr[rsi],65; is the byte 65 ('A')?seteal; set result to 0 or 1 accordinglyret; return.LBB0_1:xoreax,eax; set result to falseret; returnt7:cmprdi,7; is length 7?jne.LBB6_1; if not, goto "return false"moveax,1145258561; set eax to "ABCD"xoreax,dwordptr[rsi]; eax ^= first four chars of svmovecx,1195787588; set ecx to "DEFG"xorecx,dwordptr[rsi+3]; ecx ^= chars 3,4,5,6 of svorecx,eax; ecx |= eaxseteal; result = 1 if "zero flag" else 0ret; returnGCC generates more obvious, but slightly worse code, with some unnecessary logic operations. I to investigate. filed a bug↩
libstdc++'s is defined as , which is why we see length in before pointer in .
std::string_viewrdirsilength then pointer↩