Choices & consequences

Spent a little bit time tweaking RDE vector class again. As I already mentioned, the container itself is not terribly fascinating, there are not too many choices here. There’s another battle going on the lower level though, it’s interesting to see how an innocent instruction like_ size()_ can be a cause of a slowdown. Typically, a vector class has 3 properties it needs to keep track of:

  • buffer properties (pointer and size),

  • number of stored elements

We can safely ignore buffer begin property - it has to be a pointer. Capacity and number of stored elements are more interesting, though. We can either store them using an integer number or as another pointer marking the end of range. What are the consequences? In the first case all the operations requiring “end” pointer will be more costly (we don’t know this pointer offhand, need to add our integer to the begin pointer). If we decide to go with the second solution – size() and/or capacity() methods will have to convert a pair of pointers into number of elements. (Third approach would be to store both, but this means additional bookkeeping, so it’s rarely used). Most implementations I’ve seen use the latter approach (pointers), as we tend to operate on array “end” quite often. The consequence is (as mentioned) – size() method now requires conversion from pointer to integer, which usually means an integer division. Consider this seemingly innocent code:

 1struct Foo
 2{
 3    char ch[13];
 4};
 5void Test(const rde::vector<Foo>& v)
 6{
 7    for(std::size_t i = 0; i < v.size(); ++i)
 8    {
 9        printf("%d\n", v[i].ch[0]);
10    }
11}

Here’s generated code (inner loop only):

 100DEE420 0F B6 04 0F      movzx       eax,byte ptr [edi+ecx]
 200DEE424 50               push        eax
 300DEE425 68 6C 3D E1 00   push        0E13D6Ch
 400DEE42A FF 15 10 58 E4 00 call        dword ptr [__imp__printf (0E45810h)]
 500DEE430 8B 46 04         mov         eax,dword ptr [esi+4]
 600DEE433 59               pop         ecx
 700DEE434 59               pop         ecx
 800DEE435 8B 0E            mov         ecx,dword ptr [esi]
 900DEE437 2B C1            sub         eax,ecx
1000DEE439 6A 0D            push        0Dh
1100DEE43B 99               cdq
1200DEE43C 5D               pop         ebp
1300DEE43D F7 FD            idiv        eax,ebp ***
1400DEE43F 43               inc         ebx
1500DEE440 83 C7 0D         add         edi,0Dh
1600DEE443 3B D8            cmp         ebx,eax
1700DEE445 72 D9            jb          `anonymous namespace'::Test+1Dh (0DEE420h)

Yuck! To be fair, compiler is sometimes smart enough to recognize that size() has no side effects and vector cannot be modified during iteration, so it moves this calculation outside the loop. Couldn’t do it in this case as vector is not local. Observations:

  • for iteration prefer iterators (doh) to indices (especially if they’re raw pointers underneath, as they should… Even better on platforms with load/store+update instructions).

  • if you know array is not modified in the loop (vast majority of cases) - compute size once.

Same loop, version with iterators:

 100A5DC58 0F B6 06         movzx       eax,byte ptr [esi]
 200A5DC5B 50               push        eax
 300A5DC5C 68 70 3D AB 00   push        0AB3D70h
 400A5DC61 FF 15 10 58 AE 00 call        dword ptr [__imp__printf (0AE5810h)]
 500A5DC67 59               pop         ecx
 600A5DC68 59               pop         ecx
 700A5DC69 83 C6 0D         add         esi,0Dh
 8for(rde::vector<Foo>::const_iterator i = v.begin(); i != v.end(); ++i)
 900A5DC6C 3B 77 04         cmp         esi,dword ptr [edi+4]
1000A5DC6F 75 E7            jne         `anonymous namespace'::TestVIter+0Ah (0A5DC58h)

..and with size only calculated once:

 1//for(std::size_t i = 0, s = v.size(); i < s; ++i)
 200C5F64E 8B 06            mov         eax,dword ptr [esi]
 300C5F650 0F B6 04 07      movzx       eax,byte ptr [edi+eax]
 400C5F654 50               push        eax
 500C5F655 68 6C 3D CA 00   push        0CA3D6Ch
 600C5F65A FF 15 10 58 CD 00 call        dword ptr [__imp__printf (0CD5810h)]
 700C5F660 59               pop         ecx
 800C5F661 83 C7 0D         add         edi,0Dh
 900C5F664 4B               dec         ebx
1000C5F665 59               pop         ecx
1100C5F666 75 E6            jne         `anonymous namespace'::TestV+1Ah (0C5F64Eh)

Finally, if you find yourself in a situation where you have to use indices and you know array size might change, consider making element’s size a power of two, so that compiler can replace idiv instruction with shift.

[Digression: GCC is actually smart enough to replace division by constant with multiplication by a “magic number” (see Hacker’s Delight). See GCC Explorer for details (I still can’t get over how awesome this tool is).]

A broader lesson from all that is we should try to avoid converting between indices & pointers if only possible. I learned it the hard way when optimizing my insert() method. Here’s the first, naive implementation (I’ll assume we’re inserting a POD type):

 1iterator insert(iterator it, const T& val)
 2{
 3    const size_type index = (size_type)(it - m_begin);
 4    const size_type prevSize = size();
 5    const size_type toMove = prevSize - index;
 6    if (m_end == m_capacityEnd)
 7    {
 8        grow();
 9        it = m_begin + index;
10    }
11    rde::construct(m_begin + prevSize);
12
13    if (toMove > 0)
14    {
15        //rde::internal::move_n(it, toMove, it + 1, int_to_type<has_trivial_copy<T>::value>());
16        // Assuming PODs, this ^^^^ will result in call to:
17        memmove(result, first, n * sizeof(T));
18    }
19    *it = val;
20    ++m_end;
21    return it;
22}

It results in this monstrosity:

 10130F4FC 8B 45 08         mov         eax,dword ptr [ebp+8]
 20130F4FF 53               push        ebx
 30130F500 56               push        esi
 40130F501 8B D9            mov         ebx,ecx
 50130F503 8B 33            mov         esi,dword ptr [ebx]
 60130F505 2B C6            sub         eax,esi
 70130F507 57               push        edi
 80130F508 99               cdq
 90130F509 6A 24            push        24h
100130F50B 59               pop         ecx
110130F50C F7 F9            idiv        eax,ecx (1)
12const size_type prevSize = size();
130130F50E 8B 4B 04         mov         ecx,dword ptr [ebx+4]
140130F511 6A 24            push        24h
150130F513 8B F8            mov         edi,eax
160130F515 8B C1            mov         eax,ecx
170130F517 2B C6            sub         eax,esi
180130F519 99               cdq
190130F51A 5E               pop         esi
200130F51B F7 FE            idiv        eax,esi (2)
210130F51D 8B F0            mov         esi,eax
22const size_type toMove = prevSize - index;
230130F51F 2B F7            sub         esi,edi
24if (m_end == m_capacityEnd)
250130F521 3B 4B 08         cmp         ecx,dword ptr [ebx+8]
260130F524 75 0F            jne         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+3Ch (130F535h)
27{
28grow();
290130F526 8B CB            mov         ecx,ebx
300130F528 E8 9E 40 FC FF   call        rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::grow (12D35CBh)
31it = m_begin + index;
320130F52D 6B FF 24         imul        edi,edi,24h (1m*)
330130F530 03 3B            add         edi,dword ptr [ebx]
340130F532 89 7D 08         mov         dword ptr [it],edi
35}
36rde::construct(m_begin + prevSize);
37
38if (toMove > 0)
390130F535 85 F6            test        esi,esi
400130F537 7E 15            jle         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+55h (130F54Eh)
41{
42rde::internal::move_n(it, toMove, it + 1, int_to_type<has_trivial_copy<T>::value>());
430130F539 8B 45 08         mov         eax,dword ptr [it]
440130F53C 6B F6 24         imul        esi,esi,24h (1/2m)
450130F53F 56               push        esi
460130F540 50               push        eax
470130F541 83 C0 24         add         eax,24h
480130F544 50               push        eax
490130F545 FF 15 C0 57 36 01 call        dword ptr [__imp__memmove (13657C0h)]
500130F54B 83 C4 0C         add         esp,0Ch
51}
52*it = val;
530130F54E 8B 75 0C         mov         esi,dword ptr [val]
540130F551 8B 7D 08         mov         edi,dword ptr [it]
55return it;
560130F554 8B 45 08         mov         eax,dword ptr [it]
570130F557 6A 09            push        9
580130F559 59               pop         ecx
590130F55A F3 A5            rep movs    dword ptr es:[edi],dword ptr [esi]
600130F55C 83 43 04 24      add         dword ptr [ebx+4],24h

2 divs, 2 muls in the worst case (1 on average)… Some of these conversions are hard to avoid, but in some cases we convert back & forth (e.g. we first divide to get ‘toMove’, but later need to obtain # of bytes, so have to multiply it again).

Here’s the same function (POD case only, branch resolved during compilation), but this time we’re actively trying to avoid conversions:

 1if (m_end == m_capacityEnd)
 2{
 3    const size_type index = (size_type)(it - m_begin);
 4    grow();
 5    it = m_begin + index;
 6}
 7rde::construct(m_end);
 8
 9if (m_end > it) // only showing POD case (resolved at compile time)
10{
11    const size_t n = reinterpret_cast<uintptr_t>(m_end) - reinterpret_cast<uintptr_t>(it);
12    Sys::MemMove(it + 1, it, n);
13}
14*it = val;
15++m_end;
16return it;
17}
Assembly:
 10086F434 8B D9            mov         ebx,ecx
 2if (m_end == m_capacityEnd)
 30086F436 8B 43 04         mov         eax,dword ptr [ebx+4]
 40086F439 56               push        esi
 50086F43A 57               push        edi
 60086F43B 3B 43 08         cmp         eax,dword ptr [ebx+8]
 70086F43E 75 20            jne         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+30h (86F460h)
 8{
 9const size_type index = (size_type)(it - m_begin);
100086F440 8B 45 08         mov         eax,dword ptr [it]
110086F443 2B 03            sub         eax,dword ptr [ebx]
120086F445 6A 24            push        24h
130086F447 99               cdq
140086F448 59               pop         ecx
150086F449 F7 F9            idiv        eax,ecx
16grow();
170086F44B 8B CB            mov         ecx,ebx
180086F44D 8B F0            mov         esi,eax
190086F44F E8 77 41 FC FF   call        rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::grow (8335CBh)
20it = m_begin + index;
210086F454 6B F6 24         imul        esi,esi,24h
220086F457 03 33            add         esi,dword ptr [ebx]
230086F459 8B FE            mov         edi,esi
240086F45B 89 7D 08         mov         dword ptr [it],edi
250086F45E EB 03            jmp         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+33h (86F463h)
260086F460 8B 7D 08         mov         edi,dword ptr [it]
27}
28rde::construct(m_end);
29
30if (m_end > it) // only showing POD case (resolved at compile time)
310086F463 8B 43 04         mov         eax,dword ptr [ebx+4]
320086F466 3B C7            cmp         eax,edi
330086F468 76 11            jbe         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+4Bh (86F47Bh)
34{
35const size_t n = reinterpret_cast<uintptr_t>(m_end) - reinterpret_cast<uintptr_t>(it);
360086F46A 2B C7            sub         eax,edi
37Sys::MemMove(it + 1, it, n);
380086F46C 50               push        eax
390086F46D 8D 47 24         lea         eax,[edi+24h]
400086F470 57               push        edi
410086F471 50               push        eax
420086F472 FF 15 C0 57 8C 00 call        dword ptr [__imp__memmove (8C57C0h)]
430086F478 83 C4 0C         add         esp,0Ch
44}
45*it = val;
460086F47B 8B 75 0C         mov         esi,dword ptr [val]
47return it;
480086F47E 8B 45 08         mov         eax,dword ptr [it]
490086F481 6A 09            push        9
500086F483 59               pop         ecx
510086F484 F3 A5            rep movs    dword ptr es:[edi],dword ptr [esi]
520086F486 83 43 04 24      add         dword ptr [ebx+4],24h

Down to 1 div/1 mul in the worst case, no divisions/multiplication in the typical case.

Obviously, these are all micro-optimizations, hard to expect spectacular gains. Still, it’s good to keep it in mind and remember how tiny changes in high level code can result in vastly different assembly being generated.

Old comments

Peng 2013-04-19 15:53:39

Hi,
I cannot understand one of copy_n methods, could u shed some light on it?

template
void copy_n(const T* first, size_t n, T* result, int_to_type)
{
const T* last = first + n;
//while (first != last)
// *result++ = *first++;
switch (n & 0x3)
{
case 0:
while (first != last)
{
*result++ = *first++;
case 3: *result++ = *first++;
case 2: *result++ = *first++;
case 1: *result++ = *first++;
}
}
}

Peng 2013-04-19 15:57:26

To be precise, i cannot understand why there’s a while loop in case 0 that includes case 1/2/3.

admin 2013-04-19 17:01:59

That’s Duff’s Device. I described it in more detail here: http://msinilo.pl/blog/?p=504 .

More Reading
Newer// MemTracer 64