Know your assembly (part N+1)

Recently we have updated some of our compilers (Clang) and started running into mysterious problems, mostly in the physics code. We were able to narrow it down to raycasts failing to hit… sometimes. What made it a bit more annoying - it was only happening on 1 platform I had no easy access too. It was ARM based, but another ARM platform was fine. It seemed like the problem was deep in the guts of PhysX, code is publicly available, so I can link it here (rayAABBIntersect2). We played a bit with compiler flags, blamed fastmath for a bit (as one does), found out it was innocent here, but couldn’t narrow it down easily. It would happen at O2/O3, not at O1, but generated code changed so much it was hard to find the culprit. On top of all that, it was ARM, as mentioned and while I can (mostly) read it, I have never written anything bigger and I was not as comfortable as with x64 (spoiler alert: if only I tried looking at Clang x64…) I tried some remote debugging, but it was a bit too slow to step through all the code this way. If you look at the source, you will probably notice the 2000s era zero-checks:

 1const PxReal* dir = &_dir.x;
 2const PxU32* idir = reinterpret_cast<const PxU32*>(dir);
 3[...]
 4if(idir[i])
 5//if(PX_IR(dir[i]))
 6	MaxT[i] = (maximum[i] - origin[i]) / dir[i];
 7[...]
 8	const PxU32* tmp = reinterpret_cast<const PxU32*>(&MaxT[WhichPlane]);
 9	if((*tmp)&PX_SIGN_BITMASK)
10//	if(PX_IR(MaxT[WhichPlane])&PX_SIGN_BITMASK)
11		return 0;

These are no longer necessary these days and C++ language lawyer will tell you it is a UB… but also this code has been around for 20+ years and worked fine on multiple platforms. Still, we replaced these with floating-point comparisons and the bug… went away. After some more experiments we discovered it was actually enough to change the ‘tmp’ part. I stared at the code more, but could not see how it was related, seemed more like some kind of a snowball effect. Finally decided to just play a ‘human debugger’ and step through the code on paper. Refer to this Godbolt snippet (after I knew what the problem was I found out I could repro it in Godbolt). Clang 16 is broken, Clang 17 is fine. Feel free to stop reading and look at the code if you would like to give it a try.

We first calculate 3 coords of MaxT either by doing: MaxT[i] = (minimum[i] - origin[i]) / dir[i] or MaxT[i] = (maximum[i] - origin[i]) / dir[i]. Compiler unrolled this 3 iteration loop, but if you look closely you will notice one of these is not like the others:

 1fsub    s0, s0, s1
 2fmov    s1, w8
 3fdiv    s0, s0, s1 // MaxT[0]
 4str     s0, [sp] 
 5[...]
 6fsub    s1, s1, s2
 7ldr     s2, [x3, #4]
 8fdiv    s1, s1, s2 // MaxT[1]
 9str     s1, [sp, #4] 
10[...]
11ldr     s3, [x2, #8]
12fsub    s2, s2, s3
13ldr     s3, [x3, #8]
14fdiv    s2, s2, s3 // MaxT[2]
15b       .LBB0_23

At first, I assumed we stored MaxT in s0/s1/s2 so didn’t think twice of the fact we didn’t store it on the stack for the Z coord. Let’s see what happens later, though. More specifically, this is the fragment when we try to pick the ‘greatest’ coordinate:

1PxU32 WhichPlane = 0;
2if(MaxT[1] > MaxT[WhichPlane])	WhichPlane = 1;
3if(MaxT[2] > MaxT[WhichPlane])	WhichPlane = 2;
4const PxU32* tmp = reinterpret_cast<const PxU32*>(&MaxT[WhichPlane]);
5if((*tmp)&PX_SIGN_BITMASK)

Corresponding assembly:

 1fcmp    s1, s0
 2mov     x10, sp
 3cset    w9, gt
 4orr     x11, x10, x9, lsl #2
 5ldr     s3, [x11]  // (1)
 6mov     w11, #2
 7fcmp    s2, s3
 8csel    x9, x9, x11, le
 9lsl     x11, x9, #2
10ldr     x10, [x10, x11] // (2) ***
11tbnz    w10, #31, .LBB0_31

(The actual code generated in our case was actually a bit different, bfi instead of orr and final lsl ‘folded’ with the ldr via utw, but the gist is the same) Don’t worry too much about individual instructions, asm code pretty much follows C++, it compares x/y first, then the ‘winner’ against Z, stores WhichPlane in x9. Two things to note: it uses registers (s0/s1/s2) for most comparisons, but hits the stack as well, ‘indexing’ it with coord (x10=sp). Marked both locations with (1) & (2). The first one is safe as we can only access X or Y (X9 is 0/1) and as we established before, they are both on the stack. The second one however.. if Z (S2) is greater, we will try to grab it from the stack… but it was never stored there. At this point, stack contains the original value (constructor initialization), ie. -1.0. This also means our sign test will return true and we will always early out from the function without hitting anything. Here is where things get interesting – if we go with ordinary floating-point comparison: if(MaxT[WhichPlane]<0.0f), it changes slightly how Clang allocates registers, but it also stores Z on the stack now:

1fsub    s2, s2, s3
2ldr     s3, [x3, #8]
3fdiv    s2, s2, s3
4str     s2, [sp, #8]
5b       .LBB0_21

I can’t really explain this, it feels like we hit a different code generation path now. Once we know what the problem is, it was easy to run some more tests in Godbolt:

  • it seems like a universal Clang bug, not tied to any architecture, X64 has the same problem (I wish I knew…). Included in the Godbolt link
  • Godbolt doesn’t have super fine-grained Clang versions. It seems like it is fine in 15, broken in 16 and fine in 17.. but based on our tests, it is more likely it broke in some minor update of 15 and was fixed i some minor update of 16. Most notably, current XCode version seems broken.
  • a more ‘scientific’ fix is disabling loop unrolling (-fno-unroll-loops), but it might have a minor perf penalty

The good news is, I can now reliably verify whether the code is fine, the bad news is, it is impossibly to eyeball assembly code for the entire project.

More Reading