# Compilers are smart

^{22}/Jul 2018

I recently transitioned to Visual Studio 2017 and while it went relatively painless, the new and improved optimizer uncovered some subtle issues lurking in the code (to be fair, Clang/GCC has been behaving same way for a long time now). The code is question was actually quite ancient and originated from this Devmaster forum post (gone now, but found if using The Wayback Machine): Fast and accurate sine/cosine. To be more precise, it was this version with ‘fast wrapping’:

```
x = x * (1.0f / PI);
// Wrap around
float z = (x + 25165824.0f);
x = x - (z - 25165824.0f);
#if LOW_SINE_PRECISION
return 4.0f * (x - x * abs(x));
#else
[..not relevant...]
```

*(sidenote: for those who don’t remember, the author (Nick) was also the guy behind Swift Shader, a 100% software/CPU based implementation
of OpenGL/DirectX9 and someone who knows low-level/optimization better than 99.99% of the population))*

Let’s focus on the ‘wrap around’ part. It’s actually a smart trick relying on knowing the details of IEEE-754 floating-point
format. We only have 23 bits of mantissa (well, 24 if you count the implied leading one). If we add a huge number (like 25165824),
that pushes us over 24 bits and we lose 1 bit which means we can only represent even numbers (16777216 would be enough if we only
cared about positive numbers). What happens is we wrap around the -1,1 range (so for example 1.5 will come out as -0.5,
2.0 will be 0, 1 is still 1 etc). It’s a neat trick, if you’re interested in details, just plug in a bunch of numbers and observe.
The “problem” here is that compiler is smart enough to notice that we adding/subtracting the same number and with */fp:fast* enabled (which is what you want if you’re writing games),
we’re allowing it to treat it as a no-op.
Now, interestingly enough, the original version actually took this into account and used *volatile* to stop the optimizer from
removing the store/load, ie:

`volatile float z = (x + 25165824.0f);`

Seems like there were more souls who thought *volatile* was not necessary and concerned about perf:
https://github.com/OpenImageIO/oiio/issues/963

Compiler Explorer example here (try removing/adding *volatile* to see the effect).
Interestingly CE’s version of Visual Studio 2015/17 doesn’t optimize it out, but Clang/GCC should work. With fast math enabled, all we get is:

```
1foo(float): # @foo(float)
2 xorps xmm0, xmm0
3 ret
```

What *should* happen (VS2015/fast math disabled) is (only relevant fragment):

```
1movss xmm0, DWORD PTR _x$[esp-4]
2cvtps2pd xmm0, xmm0
3mulsd xmm0, QWORD PTR __real@3fd45f306dc9c883 // * (1.0 / PI)
4cvtpd2ps xmm2, xmm0
5movaps xmm1, xmm2
6addss xmm1, DWORD PTR __real@4bc00000
7subss xmm1, DWORD PTR __real@4bc00000
8subss xmm2, xmm1
```

With fast math and *volatile* (Clang):

```
1addss xmm0, dword ptr [rip + .LCPI0_1]
2movss dword ptr [rsp - 4], xmm0
3movss xmm1, dword ptr [rsp - 4] # xmm1 = mem[0],zero,zero,zero
4subss xmm0, xmm1
```

With fast math and *volatile* (VS2017):

```
1addss xmm1,dword ptr [__real@4bc00000 (07FF7D2672410h)]
2movss dword ptr [rsp+8],xmm1
3movss xmm1,dword ptr [z]
4subss xmm1,dword ptr [__real@4bc00000 (07FF7D2672410h)]
5subss xmm0,xmm1
```

You might be a bit concerned about the store/load, but I wouldn’t worry about it too much, we don’t really have
to wait for it, store forwarding should save us here. Purely out of curiosity, we
still decided to see if we can persuade the compiler to generate a bit tighter code. One thing that was tried was
moving *volatile*:

```
volatile float BIG_C = 25165824.0f;
[...]
float z = (x + BIG_C);
x = x - (z - BIG_C);
```

That was enough to fool MSVC:

```
1movaps xmm2,xmm0
2addss xmm2,dword ptr [BIG_C (07FF60EB6A008h)]
3movss xmm1,dword ptr [BIG_C (07FF60EB6A008h)]
4subss xmm2,xmm1
5subss xmm0,xmm2
```

…but Clang could see right through our ruse, that one caused a bit of head scratching, it actually left some of the code, but removed the bit where we divide by *PI*
(*mulsd* in one of the snippets above):

```
1foo(float): # @foo(float)
2 movss xmm1, dword ptr [rip + BIG_C] # xmm1 = mem[0],zero,zero,zero
3 movss xmm0, dword ptr [rip + BIG_C] # xmm0 = mem[0],zero,zero,zero
4 subss xmm0, xmm1
```

Why? Well, it was smart enough to notice the expression is something like:

`x = x - (x + C - C) = (x + C) - (x + C)`

*x*is. We asked to load

*C*, so it did that, but

*only*that. Well played, Clang/GCC, well played.

It didn’t really matter all that much, as mentioned, it was purely a fun exercise, but I was still quite impressed.