Choices and consequences (part 2)

I’ve been doing some minor refactoring recently and - once again - it got my thinking how seemingly tiny C++ changes can affect generated assembly code (and performance). It was a fairly simple scenario, had a collection of ~60 small items identified by an unique ID. ID was a 32-bit number, but realistically the range seemed to be limited to 0-5000 (although I couldn’t rely on it staying like that forever). The old code was organized in a different way and could access elements by indexing the table directly. I had to modify the organization (ID part), but wanted to keep the O(1) access if possible. Collection was static (built once) and we wouldn’t add new elements too frequently (as in, maybe once a month or so, but possibly not even that)

My first (lazy) solution was to simply go with std::unordered_map. I hoped it’d be good enough, plugged it in and had a quick look at the generated code for the find method and number of collisions. The latter turned out to be very decent, just 1 collision, but can’t say I loved the assembly:

; size_type _Bucket = _Hashval(_Keyval);
00007FF794D95C2F  mov         rdx,0CBF29CE484222325h  
00007FF794D95C39  xor         r10d,r10d  
00007FF794D95C3C  movzx       eax,byte ptr [r10+r8]  
00007FF794D95C41  mov         rcx,100000001B3h  
00007FF794D95C4B  xor         rdx,rax  
00007FF794D95C4E  inc         r10  
00007FF794D95C51  imul        rdx,rcx  
00007FF794D95C55  cmp         r10,4  
00007FF794D95C59  jb          07FF794D95C3Ch

See 4 muls? These are the FNV iterations. Can’t really blame STL here, FNV is my go-to default hasher too (and we didn’t provide one), but in this particular case, it might hurt us more to avoid conflicts than to actually roll with them. Let’s try the quick’n’dirty hand-crafted open addressing with no hashing at all (ie. hash function = identity). This is normally a terrible idea, but:

  • <100 elements. We can easily afford to make our table at least 4 times as big
  • elements are small (4 bytes). Re-probing shouldn’t hurt as it’s very likely consecutive items are in cache.
  • it’s guaranteed the id we’re looking for is actually present in the table

Generated code (inner loop for probing):

1 00007FF7E5C435AE  add         r11d,r8d  
2 00007FF7E5C435B1  mov         ecx,r10d  
3 00007FF7E5C435B4  and         ecx,r11d  ; hash-tab mask
4 00007FF7E5C435B7  cmp         dword ptr [r9+rcx*8],edx  ; key comparison
5 00007FF7E5C435BB  je          07FF7E5C435C8h ; found!
6 ; Loop
7 00007FF7E5C435BD  inc         r8  
8 00007FF7E5C435C0  cmp         r8,r10  
9 00007FF7E5C435C3  jne         07FF7E5C435AEh

[Note: I guess I could have as well used unordered_map with a custom hasher, but I only needed a tiny subset of a full-blown hash_map (no rehash, just insert+find)] Best case scenario - we break out of the loop in line 5. If we have to re-probe, we’ll try the next iteration. As you can see the loop is very tight, so having to repeat it shouldn’t hurt. We go against the conventional wisdom in some other aspects as well, e.g. I decided to go with linear probing. Normally, it’s not the best choice as it leads to clustering, but - again - we have a decent idea what’s our data set is going to be, so we can verify that. As it turns out, it’s not too bad. We get 5 collisions (vs 1 before), but our longest ‘cluster’ is 2 elements (so in the worst case, we iterate twice). Quadratic probing would require an extra multplication.
(Side note: one trick I use is a weird variant of linear probing that modifies the starting value in every iteration (r11 in the assembly above. This means our increments are actually 1, 3, 6, 10, 15 etc, not 1,2,3.. As mentioned - doesn’t really matter here, as we never reprobe more than once)

Astute reader will notice our hash table size is power-of-2. Again, this is frowned upon in some circles as prime tab sizes tend to give less collisions. Using power-of-2 is nice though because we can substitute modulo operation with an AND (modulo = division = pain.. that’s one of the few remaining instructions that’s actually really expensive).

As you can see, we broke at least 3 rules of ‘proper’ hash_map use: terrible hash function, “linear” probing & power-of-2 table and yet ended up with a very decent perf, beating the textbook solution.

Interestingly, few days later I discovered 2 interesting posts that cover similar isses:

  • Benchmarks vs The World - author makes very good point why the generic solution should be done the way it’s done, but we’re in ‘benchmark’ mode here, we don’t need a one size fits all solution
  • Ruby 2.4 hashes - interestingly enough, they went from prime size tables to power-of-2 as well. I’m actually quite surprised it’s 2016 and we’re only now going away from chaining.