Choices and consequences (part 2)
29/Dec 2016
I’ve been doing some minor refactoring recently and - once again - it got my thinking how seemingly tiny C++ changes can affect generated assembly code (and performance). It was a fairly simple scenario, had a collection of ~60 small items identified by an unique ID. ID was a 32-bit number, but realistically the range seemed to be limited to 0-5000 (although I couldn’t rely on it staying like that forever). The old code was organized in a different way and could access elements by indexing the table directly. I had to modify the organization (ID part), but wanted to keep the O(1) access if possible. Collection was static (built once) and we wouldn’t add new elements too frequently (as in, maybe once a month or so, but possibly not even that)
My first (lazy) solution was to simply go with std::unordered_map. I hoped it’d be good enough, plugged it in and had a quick look at the generated code for the find method and number of collisions. The latter turned out to be very decent, just 1 collision, but can’t say I loved the assembly:
; size_type _Bucket = _Hashval(_Keyval);
00007FF794D95C2F mov rdx,0CBF29CE484222325h
00007FF794D95C39 xor r10d,r10d
00007FF794D95C3C movzx eax,byte ptr [r10+r8]
00007FF794D95C41 mov rcx,100000001B3h
00007FF794D95C4B xor rdx,rax
00007FF794D95C4E inc r10
00007FF794D95C51 imul rdx,rcx
00007FF794D95C55 cmp r10,4
00007FF794D95C59 jb 07FF794D95C3Ch
See 4 muls? These are the FNV iterations. Can’t really blame STL here, FNV is my go-to default hasher too (and we didn’t provide one), but in this particular case, it might hurt us more to avoid conflicts than to actually roll with them. Let’s try the quick’n’dirty hand-crafted open addressing with no hashing at all (ie. hash function = identity). This is normally a terrible idea, but:
- <100 elements. We can easily afford to make our table at least 4 times as big
- elements are small (4 bytes). Re-probing shouldn’t hurt as it’s very likely consecutive items are in cache.
- it’s guaranteed the id we’re looking for is actually present in the table
Generated code (inner loop for probing):
100007FF7E5C435AE add r11d,r8d
200007FF7E5C435B1 mov ecx,r10d
300007FF7E5C435B4 and ecx,r11d ; hash-tab mask
400007FF7E5C435B7 cmp dword ptr [r9+rcx*8],edx ; key comparison
500007FF7E5C435BB je 07FF7E5C435C8h ; found!
6; Loop
700007FF7E5C435BD inc r8
800007FF7E5C435C0 cmp r8,r10
900007FF7E5C435C3 jne 07FF7E5C435AEh
[Note: I guess I could have as well used unordered_map with a custom hasher, but I only needed a tiny subset of a full-blown hash_map (no rehash, just insert+find)]
Best case scenario - we break out of the loop in line 5. If we have to re-probe, we’ll try the next iteration. As you can see the loop is very tight, so having to repeat it shouldn’t hurt. We go against the conventional wisdom in some other aspects as well, e.g. I decided to go with linear probing. Normally, it’s not the best choice as it leads to clustering, but - again - we have a decent idea what’s our data set is going to be, so we can verify that. As it turns out, it’s not too bad. We get 5 collisions (vs 1 before), but our longest ‘cluster’ is 2 elements (so in the worst case, we iterate twice). Quadratic probing would require an extra multplication.
(Side note: one trick I use is a weird variant of linear probing that modifies the starting value in every iteration (r11 in the assembly above. This means our increments are actually 1, 3, 6, 10, 15 etc, not 1,2,3.. As mentioned - doesn’t really matter here, as we never reprobe more than once)
Astute reader will notice our hash table size is power-of-2. Again, this is frowned upon in some circles as prime tab sizes tend to give less collisions. Using power-of-2 is nice though because we can substitute modulo operation with an AND (modulo = division = pain.. that’s one of the few remaining instructions that’s actually really expensive).
As you can see, we broke at least 3 rules of ‘proper’ hash_map use: terrible hash function, “linear” probing & power-of-2 table and yet ended up with a very decent perf, beating the textbook solution.
Interestingly, few days later I discovered 2 interesting posts that cover similar isses:
- Benchmarks vs The World - author makes very good point why the generic solution should be done the way it’s done, but we’re in ‘benchmark’ mode here, we don’t need a one size fits all solution
- Ruby 2.4 hashes - interestingly enough, they went from prime size tables to power-of-2 as well. I’m actually quite surprised it’s 2016 and we’re only now going away from chaining.