Smartness overload #2
23/Oct 2013
Today’s article is brought to you by a friend of mine. He’s been doing some home experiments and noticed a particular piece of code was running faster in Debug than Release (using Visual C++ compiler). He mentioned this to me, I found this intriguing enough to ask him for this snippet and investigated it yesterday. Turned out it was a classical case of compiler trying to be too smart and shooting itself in the foot. Consider the following piece of code (it’s not exactly the same as the one I’ve been analyzing, I stripped it to a bare minimum), built for x86:
1// numToTest & lc are __int64s
2for(__int64 i = 2; i < numToTest; ++i)
3{
4 if(numToTest % i == 0)
5 {
6 lc = numToTest / i;
7 sd += i;
8 //.. some other stuff, not too much, though
As you can see, it’s fairly innocent, inner loop is tight (compiles to maybe 15-20 instructions) and doesn’t touch memory too much, so I could see Debug running at roughly the same speed as Release, but being faster? That’s something you do not see very often. It was consistently faster too and by a fair margin.
Let’s see generated assembly code (loop body). Debug:
1013C167C 8B 45 D8 mov eax,dword ptr [ebp-28h]
2013C167F 50 push eax
3013C1680 8B 4D D4 mov ecx,dword ptr [i]
4013C1683 51 push ecx
5013C1684 8B 55 0C mov edx,dword ptr [ebp+0Ch]
6013C1687 52 push edx
7013C1688 8B 45 08 mov eax,dword ptr [numToTest]
8013C168B 50 push eax
9013C168C E8 D8 F9 FF FF call @ILT+100(__allrem) (13C1069h)
10013C1691 89 85 08 FF FF FF mov dword ptr [ebp-0F8h],eax
11013C1697 89 95 0C FF FF FF mov dword ptr [ebp-0F4h],edx
12013C169D 8B 8D 08 FF FF FF mov ecx,dword ptr [ebp-0F8h]
13013C16A3 0B 8D 0C FF FF FF or ecx,dword ptr [ebp-0F4h]
14013C16A9 75 2D jne sd_Local+0C8h (13C16D8h)
15{
16lc = numToTest / i;
17013C16AB 8B 45 D8 mov eax,dword ptr [ebp-28h]
18013C16AE 50 push eax
19013C16AF 8B 4D D4 mov ecx,dword ptr [i]
20013C16B2 51 push ecx
21013C16B3 8B 55 0C mov edx,dword ptr [ebp+0Ch]
22013C16B6 52 push edx
23013C16B7 8B 45 08 mov eax,dword ptr [numToTest]
24013C16BA 50 push eax
25013C16BB E8 BD F9 FF FF call @ILT+120(__alldiv) (13C107Dh)
26013C16C0 89 45 E4 mov dword ptr [lc],eax
27013C16C3 89 55 E8 mov dword ptr [ebp-18h],edx
Release:
100291038 56 push esi
200291039 57 push edi
30029103A 55 push ebp
40029103B 50 push eax
50029103C E8 CF 08 00 00 call _alldvrm (291910h)
600291041 0B CB or ecx,ebx
700291043 75 10 jne sd_Local+55h (291055h)
8{
9lc = numToTest / i;
10sd += i;
1100291045 01 7C 24 10 add dword ptr [esp+10h],edi
1200291049 89 44 24 18 mov dword ptr [esp+18h],eax
130029104D 89 54 24 1C mov dword ptr [esp+1Ch],edx
1400291051 11 74 24 14 adc dword ptr [esp+14h],esi
At first glance, it doesn’t explain much, if anything, Release is tighter and keeps data in registers (as expected). It only calls 1 function vs 2 in Debug, too… The interesting part is – it calls a different function. As you probably know, 64-bit operations are a little bit tricky in 32-bit applications. Numbers take 2 registers and we need to deal with overflows properly. That’s why MSVC encapsulates 64-bit divisions in functions (as opposed to just spitting out 40+ assembly instructions every time). As it turns out, it has at least 3 of these (+ variants for unsigned types):
lldiv - 64-bit divide (x/y),
llrem - 64-bit remainder (x % y),
lldvrm - both (x/y AND x % y)
It does make sense to calculate both divide & remainder at once sometimes as it’s not that much additional work. That’s what the compiler tried to do here. It was smart enough to notice we use both numToTest % i and numToTest / i, so generated a call to lldvrm to get them both in one shot. The problem is, lldvrm is also a little bit more expensive than llrem (not much, but it adds up), see reference implementations here & here. If you think about it, our if(numToTest % i == 0) test will evaluate to false in the vast majority of cases. This means that Debug build will get away with cheaper llrem call and only has to execute lldiv in few situations. Release buid OTOH will pay for calculating both x/y & x%y, even though it only uses the latter. That’s enough to nullify the effects of better register management and not having to call lldiv in a handful of cases.
Other observations:
obviously, problem doesn’t manifest itself in x64 builds, no need for function calls there (both builds run much faster than 32-bit version, Release faster than Debug),
I couldn’t find an easy way to persuade the compiler to stop using lldvrm. MSVC doesn’t have branching hints, I tried PGO, hoping it’ll realize that if evaluates to false in 95% of cases, but it didn’t help either. Doing:
1__declspec(noinline) __int64 Rem64(__int64 x, __int64 y) 2{ 3 return(x % y); 4}
..and using Rem64(numToTest, i) instead of numToTest % i did help, but it’s not ideal, as you pay the price of additional function call (has to be there, though, so that compiler doesn’t notice the optimization “opportunity”).
Old comments
Ofek 2013-10-26 07:54:39
Nice investigation! I didn’t know about these function calls. Thanks for sharing this.