Compilers are (too) smart

Last week I noticed some unexpected assembly code in a function calling std::unordered_map::find (yes, I know, friends don’t let friends use STL, but that is a topic for another discussion). I decided to investigate a bit and it turned out to be quite a journey. It was a Clang compiler, ARM-based platform, I will use x64 here as behavior is similar and it is easier to repro in the Compiler Explorer. What is important however, we should be using libcxx STL (-stdlib=libc++ switch). A minimal example might look like:

using hmap = std::unordered_map<size_t,void*,ident>;
bool foo(const hmap& hm, size_t k)
    return hm.find(k) != hm.end();

(ident is our version of std::identity, ie. our hash of ‘x’ is just ‘x’, key was pre-hashed, I don’t think it matter, but removes hashing from the equation) Simple enough, yet if we look at the generated assembly there is this giant blob at the beginning:

mov     rax, r8
shr     rax
movabs  rcx, 6148914691236517205
and     rcx, rax
mov     rax, r8
sub     rax, rcx
movabs  rcx, 3689348814741910323
mov     rdx, rax
and     rdx, rcx
shr     rax, 2
and     rax, rcx
add     rax, rdx
mov     rcx, rax
shr     rcx, 4
add     rcx, rax
movabs  rax, 1085102592571150095
and     rax, rcx
movabs  r9, 72340172838076673
imul    r9, rax
shr     r9, 56
cmp     r9, 1
ja      .LBB0_3

It bothered me because it was almost as many instructions as the rest of the function on the fast path! What is going on here? At first I thought it was not eliminating the hash function properly, but it was quickly disproved. Don’t kick yourself if you can’t tell what that code is doing, it was a bit more obvious in the ARM version as it uses hex constants. What are these weird numbers? Well,

6148914691236517205 = 0x5555555555555555
3689348814741910323 = 0x3333333333333333
1085102592571150095 = 0xf0f0f0f0f0f0f0f
72340172838076673 = 0x101010101010101
Still not clear? It was not for me either, but it was familiar enough and ‘bit-twiddly’ enough to find an answer with some quick web search - it basically counts the number of set bits (1s) in a word. I still didn’t understand where did it come from, it was all inlined in optimized builds and obviously looked very different in debug. I tried to step through the C++ code anyway, looking for clues and found this . find calls constrain_hash (multiple times) – it basically converts a potentially very big hash to 0->bucket count range…

inline _LIBCPP_HIDE_FROM_ABI size_t __constrain_hash(size_t __h, size_t __bc) {
  return !(__bc & (__bc - 1)) ? __h & (__bc - 1) : (__h < __bc ? __h : __h % __bc);

As we can see, it tries very hard to avoid division, so if our bucket count (bc) is a power-of-two, we can get away with an AND. x&(x-1)==0 is simply a power of two test, but another way to do it is… to count the set bits, powers of two will only have one. Why did Clang decide to use? Well, some CPUs have an instruction to do it, in which case it makes a bit more sense. Compare same snippet compiled with -msse4 (you need SSE4 on x64 for popcnt):

popcnt  r9, r8
cmp     r9, 1
ja      .LBB0_3

Now, if you ask me, I’m still not sure it is worth it, yes, it is 1 instruction shorter and needs 1 register less, but popcnt has some latency… not an expert though. What is interesting, though, it is incredibly finicky. For example, when searching for more information on constrain_hash, I found this change . Undoing part of it, ie. going back to:

inline _LIBCPP_HIDE_FROM_ABI size_t __constrain_hash(size_t __h, size_t __bc) {
  return !(__bc & (__bc - 1)) ? __h & (__bc - 1) : (__h % __bc);

… is enough to stop Clang from trying to use the popcnt version. I also tried creating a minimal repro, without including unordered_map. It turns out that constrain_hash in isolation does only the DEC/TEST version, it is only find that calls it (potentially) multiple times that tries to get fancy. It does make a bit more sense here as we only do popcnt once and then reuse the bitcount to quickly select between DIV and AND. However, it is absolutely possible to get similar optimizations with the other version. MSVC does a decent job, but it repeats some of the calculations, but if you can convince Clang to drop popcnt (for example by using ‘old’ constrain_hash) it generates a pretty nice code with basically 2 distinct loops.

If you would like to experiment yourself, Compiler Explorer snippet can be found here (change #if 1 to #if 0 in constrain_hash to observe the effect). CE has a pretty great feature which lets you see both LLVM IR and the Optimization Pipeline - it is really quite fascinating. For example if you observe the optimization process, you will notice it begins with

%sub = sub i64 %1, 1
%and = and i64 %0, %sub
%tobool = icmp ne i64 %and, 0

… then at some point, during the InstCombinePass it changes to

%0 = call range(i64 0, 65) i64 @llvm.ctpop.i64(i64 %__bc)
%tobool.not = icmp ult i64 %0, 2

…X86 DAG->DAG (x86-isel) will however implement it as DEC/TEST:

6:gr64 = DEC64r %5:gr64(tied-def 0), implicit-def dead $eflags; example.cpp:15:17
TEST64rr %5:gr64, killed %6:gr64, implicit-def $eflags; example.cpp:15:17

This applies to constrain_hash in separation, though. find works differently – it inlines constrain_hash still in the ctpop form (InlinerPass):

%0 = call range(i64 0, 65) i64 @llvm.ctpop.i64(i64 %__bc)
%tobool.not.i = icmp ult i64 %0, 2

… in this case however x86-isel stage will generate the popcnt instruction (or a terribly verbose equivalent). I do think it is a bug in this case, native popcnt could maybe be defended, but should not be used if compiling for architecture that does not support it.

More Reading
Older// Streaky locks