Compilers are smart

I recently transitioned to Visual Studio 2017 and while it went relatively painless, the new and improved optimizer uncovered some subtle issues lurking in the code (to be fair, Clang/GCC has been behaving same way for a long time now). The code is question was actually quite ancient and originated from this Devmaster forum post (gone now, but found if using The Wayback Machine): Fast and accurate sine/cosine. To be more precise, it was this version with ‘fast wrapping’:

x = x * (1.0f / PI);
// Wrap around
float z = (x + 25165824.0f);
x = x - (z - 25165824.0f);
#if LOW_SINE_PRECISION
return 4.0f * (x - x * abs(x));
#else
[..not relevant...]

(sidenote: for those who don’t remember, the author (Nick) was also the guy behind Swift Shader, a 100% software/CPU based implementation of OpenGL/DirectX9 and someone who knows low-level/optimization better than 99.99% of the population))

Let’s focus on the ‘wrap around’ part. It’s actually a smart trick relying on knowing the details of IEEE-754 floating-point format. We only have 23 bits of mantissa (well, 24 if you count the implied leading one). If we add a huge number (like 25165824), that pushes us over 24 bits and we lose 1 bit which means we can only represent even numbers (16777216 would be enough if we only cared about positive numbers). What happens is we wrap around the -1,1 range (so for example 1.5 will come out as -0.5, 2.0 will be 0, 1 is still 1 etc). It’s a neat trick, if you’re interested in details, just plug in a bunch of numbers and observe. The “problem” here is that compiler is smart enough to notice that we adding/subtracting the same number and with /fp:fast enabled (which is what you want if you’re writing games), we’re allowing it to treat it as a no-op. Now, interestingly enough, the original version actually took this into account and used volatile to stop the optimizer from removing the store/load, ie:

volatile float z = (x + 25165824.0f);

Seems like there were more souls who thought volatile was not necessary and concerned about perf: https://github.com/OpenImageIO/oiio/issues/963

Compiler Explorer example here (try removing/adding volatile to see the effect). Interestingly CE’s version of Visual Studio 2015/17 doesn’t optimize it out, but Clang/GCC should work. With fast math enabled, all we get is:

1foo(float): # @foo(float)
2  xorps xmm0, xmm0
3  ret

What should happen (VS2015/fast math disabled) is (only relevant fragment):

1movss    xmm0, DWORD PTR _x$[esp-4]
2cvtps2pd xmm0, xmm0
3mulsd    xmm0, QWORD PTR __real@3fd45f306dc9c883 // * (1.0 / PI)
4cvtpd2ps xmm2, xmm0
5movaps   xmm1, xmm2
6addss    xmm1, DWORD PTR __real@4bc00000
7subss    xmm1, DWORD PTR __real@4bc00000
8subss    xmm2, xmm1

With fast math and volatile (Clang):

1addss xmm0, dword ptr [rip + .LCPI0_1]
2movss dword ptr [rsp - 4], xmm0
3movss xmm1, dword ptr [rsp - 4] # xmm1 = mem[0],zero,zero,zero
4subss xmm0, xmm1

With fast math and volatile (VS2017):

1addss xmm1,dword ptr [__real@4bc00000 (07FF7D2672410h)]  
2movss dword ptr [rsp+8],xmm1  
3movss xmm1,dword ptr [z]  
4subss xmm1,dword ptr [__real@4bc00000 (07FF7D2672410h)]  
5subss xmm0,xmm1  

You might be a bit concerned about the store/load, but I wouldn’t worry about it too much, we don’t really have to wait for it, store forwarding should save us here. Purely out of curiosity, we still decided to see if we can persuade the compiler to generate a bit tighter code. One thing that was tried was moving volatile:

volatile float BIG_C = 25165824.0f;
[...]
float z = (x + BIG_C);
x = x - (z - BIG_C);

That was enough to fool MSVC:

1movaps xmm2,xmm0  
2addss  xmm2,dword ptr [BIG_C (07FF60EB6A008h)]  
3movss  xmm1,dword ptr [BIG_C (07FF60EB6A008h)]  
4subss  xmm2,xmm1  
5subss  xmm0,xmm2  

…but Clang could see right through our ruse, that one caused a bit of head scratching, it actually left some of the code, but removed the bit where we divide by PI (mulsd in one of the snippets above):

1foo(float): # @foo(float)
2  movss xmm1, dword ptr [rip + BIG_C] # xmm1 = mem[0],zero,zero,zero
3  movss xmm0, dword ptr [rip + BIG_C] # xmm0 = mem[0],zero,zero,zero
4  subss xmm0, xmm1

Why? Well, it was smart enough to notice the expression is something like:

x = x - (x + C - C) = (x + C) - (x + C)
…and at this point it doesn’t really matter what x is. We asked to load C, so it did that, but only that. Well played, Clang/GCC, well played.

It didn’t really matter all that much, as mentioned, it was purely a fun exercise, but I was still quite impressed.

More Reading
Older// Yellowknife