Not all dot products are equal

I had another interesting debugging session recently with quite an unexpected conclusion. It all started when we received a new crash report - a certain platform (and 1 platform only) was crashing in a fairly old and seemingly innocent fragment of code:

someModule->GetElements(elements);
std::sort(elements.begin(), elements.end(),
    [&origin](const Element* a, const Element* b)
    {
        return a->pos.DistSqr(origin) < b->pos.DistSqr(origin);
    });

Crash was fairly rare although quite consistent in a particular location in the game, release build only, crashing on access violation inside the predicate. At the first glance it seemed like one of the elements was null. As mentioned, code was quite old, but I recently modified ‘someModule’ (used to get the elements), so it landed on my plate. I spent some time looking at my changes, but couldn’t really see how they’d be implicated. I couldn’t really repro it myself, but had a crash dump with some info and was DM-ing with a person who actually had in the debugger (a release build, but better than nothing). I also noticed that what I had originally assumed to be a- null pointer, wasn’t really null. It was some small value, but not zero (taking offset of ‘pos’ member var into account of course, but it still didn’t match). The actual code that was crashing was deep in the guts of STL sort, written in the usual STL manner, line looked like:

for (_BidIt _Prev = _Hole;_DEBUG_LT_PRED(_Pred, _Val, *--_Prev);_Hole = _Prev)

Cross-referencing assembly it quickly became clear it was trying to access Prev[-1] (ie. an element ‘before’ Prev). Pointer itself looked fine, but the value loaded from that address was bogus (as said, ‘close to null, but not quite’). In theory this should be a part of our elements array, but as far as we could tell, the array was fine (it was quite big, so was not easy to eyeball all the elements, but no obvious problems). Looking ‘around’ the array however, I noticed that the invalid value just before the first element. Inspecting the value of Prev confirmed it was in fact outside the array bounds… We somehow iterated ‘too far backwards’! If you worked with STL sort enough you can probably guess where this is going… and I guess I’ve been burned enough times that I recognized the symptoms quite quickly. Let’s consult the official documentation for std::sort. The fragment we’re interested in is this: Compare must meet the requirements of Compare (Compare being the type of our predicate). Mhm, I guess this sounds super redundant, but the important bit in the linked page is this:

The return value of the function call operation applied to an object of a type satisfying Compare, when contextually converted to bool, yields true if the first argument of the call appears before the second in the strict weak ordering relation induced by this type, and false otherwise.

The TLDR on the strict weak ordering is:

  • For all a, comp(a, a) == false
  • If comp(a, b) == true then comp(b, a) == false
  • if comp(a, b) == true and comp(b, c) == true then comp(a, c) == true

If these requirements are violated… bad things happen. To be honest, I have not seen this out-of-bounds access before, but I did run into ‘sort just gets in a never-ending loop’ problem. Typically debug builds will verify it for you (that is what DEBUG_LT_PRED is doing), but as I said – debug builds were fine for us. OK, so we suspect what is the problem, but the questions is ‘why?’ and ‘why is it new?’. An obvious suspect is DistSqr which is basically just a subtraction+dot product and if you think it should not be done for every comparison, you’re right, but the second question remained. At this point we remembered that the dot product function has actually been ‘modernized’ fairly recently to make use of horizontal adds. We couldn’t prove it yet, but our handwavy theory was that ‘something’ in that change made comparison less reliable. I attempted a speculative fix, computed distances once and used them for sorting (so we were no longer calculating Dot products inside the compare function). Our reliable repro case stopped crashing, so… success? I guess so, but I needed closure. PC assembly looked innocent, but the AVX2 asm uncovered something interesting. Here is a fragment of the sort function + corresponding assembly (AVX2+/fp:fast):

if (_DEBUG_LT_PRED(_Pred, _Val, *_First))
{   // found new earliest element, move to front
    _Move_backward_unchecked(_First, _Mid, ++_Hole);
    *_First = _STD move(_Val);
} else { // look for insertion point after first
    for (_BidIt _Prev = _Hole;_DEBUG_LT_PRED(_Pred, _Val, *--_Prev);_Hole = _Prev)

Asm (fragment, simplified, just dot product, not dist sqr):

; XMM3 = dot product for _Val
vmovups xmm6, XMMWORD PTR [r8]
vmulps  xmm1, xmm6, XMMWORD PTR [r10]
vmovups xmm4, XMMWORD PTR [rax]
vmulps  xmm0, xmm6, xmm4
vmovshdup xmm3, xmm0
vmovhlps xmm0, xmm0, xmm0
vfmadd231ss xmm3, xmm6, xmm4
vaddss  xmm3, xmm3, xmm0
; [...]
; XMM2 = dot product for Prev[-1]
lea     rcx, QWORD PTR [rax-16]
vmovups xmm5, XMMWORD PTR [rcx]
vmulps  xmm1, xmm5, xmm6
vmovshdup xmm0, xmm1
vaddss  xmm2, xmm1, xmm0
vmovhlps xmm0, xmm1, xmm1
vaddss  xmm2, xmm2, xmm0
vcomiss xmm3, xmm2
jae     SHORT $LN6@Test

As you can see compiler is smart enough to try and re-use the value of dot product for _Val (it stores it in XMM3) and use it again for the comparison against *–Prev. The problem is… these 2 blocks are not exactly the same. The first block uses fused multiply-add (vfmadd231ss), the second one - more traditional mul/add combo. It is not the only place too, I didn’t inspect the entire sort function, but many other subroutines had the same problem and were inconsistent. 99.9% of the time, it works ‘fine’, but since the rounding behavior for fmadd & mul/add is different, it is possible we will get slightly different results for same element.
Strictly speaking, we can’t even fully blame the compiler here, we asked for it (fastmath). To me, it is a bit weird it is not generating the same code, but I didn’t dig deep enough to prove it is ‘wrong’. My initial gut feeling was fmadd was a bit redundant, we already have squared the value, why do it again, but MCA (Machine Code Analyzer) doesn’t think it is a big deal.
I brought this up on Mastodon and Tom Forsyth made a very good point – technically even without fmadd we could be doing (X^2+Y^2) + Z^2 or X^2 + (Y^2+Z^2) or any other combination. It has never been a problem for us before fmadd, but yes, it is another potential trap.

One could argue it was always a problem we were re-calculating distances inside the comparison function and that is true too, but it is one of these things that worked fine… until it didn’t.