Smartness overload #2

Today’s article is brought to you by a friend of mine. He’s been doing some home experiments and noticed a particular piece of code was running faster in Debug than Release (using Visual C++ compiler). He mentioned this to me, I found this intriguing enough to ask him for this snippet and investigated it yesterday. Turned out it was a classical case of compiler trying to be too smart and shooting itself in the foot. Consider the following piece of code (it’s not exactly the same as the one I’ve been analyzing, I stripped it to a bare minimum), built for x86:

1// numToTest & lc are  __int64s
2for(__int64 i = 2; i < numToTest; ++i)
4    if(numToTest % i == 0)
5    {
6        lc = numToTest / i;
7        sd += i;
8        //.. some other stuff, not too much, though

As you can see, it’s fairly innocent, inner loop is tight (compiles to maybe 15-20 instructions) and doesn’t touch memory too much, so I could see Debug running at roughly the same speed as Release, but being faster? That’s something you do not see very often. It was consistently faster too and by a fair margin.

Let’s see generated assembly code (loop body). Debug:

 1013C167C 8B 45 D8         mov         eax,dword ptr [ebp-28h]
 2013C167F 50               push        eax
 3013C1680 8B 4D D4         mov         ecx,dword ptr [i]
 4013C1683 51               push        ecx
 5013C1684 8B 55 0C         mov         edx,dword ptr [ebp+0Ch]
 6013C1687 52               push        edx
 7013C1688 8B 45 08         mov         eax,dword ptr [numToTest]
 8013C168B 50               push        eax
 9013C168C E8 D8 F9 FF FF   call        @ILT+100(__allrem) (13C1069h)
10013C1691 89 85 08 FF FF FF mov         dword ptr [ebp-0F8h],eax
11013C1697 89 95 0C FF FF FF mov         dword ptr [ebp-0F4h],edx
12013C169D 8B 8D 08 FF FF FF mov         ecx,dword ptr [ebp-0F8h]
13013C16A3 0B 8D 0C FF FF FF or          ecx,dword ptr [ebp-0F4h]
14013C16A9 75 2D            jne         sd_Local+0C8h (13C16D8h)
16lc = numToTest / i;
17013C16AB 8B 45 D8         mov         eax,dword ptr [ebp-28h]
18013C16AE 50               push        eax
19013C16AF 8B 4D D4         mov         ecx,dword ptr [i]
20013C16B2 51               push        ecx
21013C16B3 8B 55 0C         mov         edx,dword ptr [ebp+0Ch]
22013C16B6 52               push        edx
23013C16B7 8B 45 08         mov         eax,dword ptr [numToTest]
24013C16BA 50               push        eax
25013C16BB E8 BD F9 FF FF   call        @ILT+120(__alldiv) (13C107Dh)
26013C16C0 89 45 E4         mov         dword ptr [lc],eax
27013C16C3 89 55 E8         mov         dword ptr [ebp-18h],edx


 100291038 56               push        esi
 200291039 57               push        edi
 30029103A 55               push        ebp
 40029103B 50               push        eax
 50029103C E8 CF 08 00 00   call        _alldvrm (291910h)
 600291041 0B CB            or          ecx,ebx
 700291043 75 10            jne         sd_Local+55h (291055h)
 9lc = numToTest / i;
10sd += i;
1100291045 01 7C 24 10      add         dword ptr [esp+10h],edi
1200291049 89 44 24 18      mov         dword ptr [esp+18h],eax
130029104D 89 54 24 1C      mov         dword ptr [esp+1Ch],edx
1400291051 11 74 24 14      adc         dword ptr [esp+14h],esi

At first glance, it doesn’t explain much, if anything, Release is tighter and keeps data in registers (as expected). It only calls 1 function vs 2 in Debug, too… The interesting part is – it calls a different function. As you probably know, 64-bit operations are a little bit tricky in 32-bit applications. Numbers take 2 registers and we need to deal with overflows properly. That’s why MSVC encapsulates 64-bit divisions in functions (as opposed to just spitting out 40+ assembly instructions every time). As it turns out, it has at least 3 of these (+ variants for unsigned types):

  • lldiv - 64-bit divide (x/y),

  • llrem - 64-bit remainder (x % y),

  • lldvrm - both (x/y AND x % y)

It does make sense to calculate both divide & remainder at once sometimes as it’s not that much additional work. That’s what the compiler tried to do here. It was smart enough to notice we use both numToTest % i and numToTest / i, so generated a call to lldvrm to get them both in one shot. The problem is, lldvrm is also a little bit more expensive than llrem (not much, but it adds up), see reference implementations here & here. If you think about it, our if(numToTest % i == 0) test will evaluate to false in the vast majority of cases. This means that Debug build will get away with cheaper llrem call and only has to execute lldiv in few situations. Release buid OTOH will pay for calculating both x/y & x%y, even though it only uses the latter. That’s enough to nullify the effects of better register management and not having to call lldiv in a handful of cases.

Other observations:

  • obviously, problem doesn’t manifest itself in x64 builds, no need for function calls there (both builds run much faster than 32-bit version, Release faster than Debug),

  • I couldn’t find an easy way to persuade the compiler to stop using lldvrm. MSVC doesn’t have branching hints, I tried PGO, hoping it’ll realize that if evaluates to false in 95% of cases, but it didn’t help either. Doing:

    1__declspec(noinline) __int64 Rem64(__int64 x, __int64 y)
    3    return(x % y);

..and using Rem64(numToTest, i) instead of numToTest % i did help, but it’s not ideal, as you pay the price of additional function call (has to be there, though, so that compiler doesn’t notice the optimization “opportunity”).

Old comments

Ofek 2013-10-26 07:54:39

Nice investigation! I didn’t know about these function calls. Thanks for sharing this.