cmov fun

If you’ve been coding for current (prev?) gen consoles, you know your optimization guidelines - be nice to your cache, avoid LHS and branches, in general - do not stress the pipeline too much. With next (current?) generation moving back to x86, things get a little bit more blurry. With out-of-order/speculative execution, register renaming and advanced branch predictors, it’s sometimes easy to shoot yourself in the foot when trying to be smarter than compiler/CPU. I have been browsing Intel’s “64 and IA-32 Architectures Optimization Reference Manual” recently and found an interesting bit on the CMOV (conditional move) instruction - “Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. […] converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine”. It intrigued me enough to prepare a quick test snippet. It includes 4 functions:

  • TestPredictable - predictable branches, conditional jump,

  • TestPredictableCmov - predictable branches, CMOV,

  • TestUnpredictable - unpredictable branches, conditional jump,

  • TestUnpredictableCmov - unpredictable branches, CMOV

Starting point was a very simple code, they’re all variation of the following:

 1 //int TestPredictable(int x, int y)
 2 int result = 0;
 3 for (size_t i = 0; i < TEST_SIZE; ++i)
 4 {
 5     if (x < y)
 6     {
 7         result = x;
 8     }
 9     ++x;
10 }

One way to generate CMOV/non-CMOV versions is to modify code generation setting (/arch:SSE2 for CMOV), but I wanted to be able to run both functions without recompiling, so I had to resort to assembly. Caller code:

r = TestPredictable(TEST_SIZE - 50, TEST_SIZE);
r = TestPredictableCmov(TEST_SIZE - 50, TEST_SIZE);

As the name suggests, branches should be easy to predict, first 50 won’t be taken, the rest - taken.

Results were… inconclusive. Both versions were usually pretty close (~200-300 CPU ticks apart per 10000 iterations, CMOV faster most of the time, Ivy Bridge CPU). So, even though branch is very predictable, CMOV doesn’t seem to make things worse (it doesn’t make it better, either). That’s to be expected, though, there’s no real data dependency here, we’ll try more complicated example later.

Let’s move to the unpredictable branch test. Code is quite similar, but the loop body looks like this:

if (x < (Random() & 1023))
    result = x;
// Caller code (roughly 50% chance of passing the test)
r = TestUnpredictable(500);
r = TestUnpredictableCmov(500);

In this case, difference is much more visible. CMOV version is consistently faster by about 20% (1.5m ticks vs 1.8m ticks per 10000 iterations).

Let’s now go back to the first example and try to wreak some havoc, make it more difficult for the CMOV. C++ code looks like this now:

 1 size_t gi = TEST_SIZE - 1;
 2 for (size_t i = 0; i < TEST_SIZE; ++i)
 3 {
 4     if (x < y)
 5     {
 6         result = gtab[gi];
 7         // for the CMOV version it'll result in something like:
 8         // cmovl ebx, dword ptr [edx] (edx being decremented in every iteration)
 9     }
10     ++x;
11     --gi;
12 }

We introduce data dependency on gtab and also traverse it backwards to make it more interesting (not very realistic scenario, I admit). Results? CMOV version is now over 2 times slower than the one with conditional jump (~22k ticks vs ~50k ticks)!

You might also want to read this discussion on the Linux mailing group. Nanhai Zou raises an interesting point there – you have better chance of seeing benefits of cmov if you can manage to put instructions which do not modify eflags between test and cmov. I think that’s what MSVC was trying to do the other day (sadly, it wasn’t careful enough).

Encore: poor MSVC can’t get a break, I noticed another peculiar behavior. In some circumstances (can’t really pinpoint the root cause), it’d completely screw up my timing code. Consider this simple fragment:

unsigned __int64 ticks = __rdtsc();
int r = TestPredictable(TEST_SIZE - 50, TEST_SIZE);
ticks = __rdtsc() - ticks;

Now, if TestPredictable includes inline assembly code, it works fine:

mov         esi,eax
mov         edi,edx
push        2710h
push        26DEh
call        `anonymous namespace'::TestPredictable (0F77530h)
mov         ecx,eax
sub         eax,esi
sbb         edx,edi

However, if I use my reference C++ code inside TestPredictable, it decides to do the following (no WPO):

mov         ecx,eax
mov         ebx,edx
mov         esi,eax
mov         edi,edx
sub         esi,ecx
sbb         edi,ebx
push        2710h
push        26DEh
call        `anonymous namespace'::TestPredictable (12B7530h)

Yup, it moved the second RDTSC up, before the call to TestPredictable.

Test code can be found here.

Old comments

Lachlan Stuart 2013-07-20 11:05:50

Testing on my Nehalem (i7-920) compiled in MSVC2010SP1 (manually checked to ensure the rdtsc reorder didn’t occur):
Ticks = 19076409 [fake result=0]
Ticks cmov = 30448753 [fake result=0]
Ticks unpredictable = 749622642 [fake result=500]
Ticks unpredictable cmov = 747278534 [fake result=500]
I had to bump the TEST_SIZE to 10000000 to get consistent results. Looks like CMOV is always equal or slower than branching on my CPU.
I’m not happy with x86’s conditional writes. CMOV and CMPXCHG(if you don’t depend on the returned value) should be faster than branching - they require fewer instructions and shouldn’t even engage the prediction engine - they are basically “fire and forget” instructions.
I’d love to see CPUs that come with GPU-style ROPs, although I may be a bit biased after having written a few software renderers as hobby projects.

admin 2013-07-22 00:11:31

Thanks for sharing your results. I agree that x86 seems a little bit schizophrenic in this regard, on one hand they introduce new instructions like cmov, on the other - they keep improving branch predictors. Both sound like good moves, but they sometimes work against each other.

More Reading
Older// Parallel 101
comments powered by Disqus