Choices & consequences

Spent a little bit time tweaking RDE vector class again. As I already mentioned, the container itself is not terribly fascinating, there are not too many choices here. There’s another battle going on the lower level though, it’s interesting to see how an innocent instruction like_ size()_ can be a cause of a slowdown. Typically, a vector class has 3 properties it needs to keep track of:

  • buffer properties (pointer and size),

  • number of stored elements

We can safely ignore buffer begin property - it has to be a pointer. Capacity and number of stored elements are more interesting, though. We can either store them using an integer number or as another pointer marking the end of range. What are the consequences? In the first case all the operations requiring “end” pointer will be more costly (we don’t know this pointer offhand, need to add our integer to the begin pointer). If we decide to go with the second solution – size() and/or capacity() methods will have to convert a pair of pointers into number of elements. (Third approach would be to store both, but this means additional bookkeeping, so it’s rarely used). Most implementations I’ve seen use the latter approach (pointers), as we tend to operate on array “end” quite often. The consequence is (as mentioned) – size() method now requires conversion from pointer to integer, which usually means an integer division. Consider this seemingly innocent code:

 1 struct Foo
 2 {
 3     char ch[13];
 4 };
 5 void Test(const rde::vector<Foo>& v)
 6 {
 7     for(std::size_t i = 0; i < v.size(); ++i)
 8     {
 9         printf("%d\n", v[i].ch[0]);
10     }
11 }

Here’s generated code (inner loop only):

 1 00DEE420 0F B6 04 0F      movzx       eax,byte ptr [edi+ecx]
 2 00DEE424 50               push        eax
 3 00DEE425 68 6C 3D E1 00   push        0E13D6Ch
 4 00DEE42A FF 15 10 58 E4 00 call        dword ptr [__imp__printf (0E45810h)]
 5 00DEE430 8B 46 04         mov         eax,dword ptr [esi+4]
 6 00DEE433 59               pop         ecx
 7 00DEE434 59               pop         ecx
 8 00DEE435 8B 0E            mov         ecx,dword ptr [esi]
 9 00DEE437 2B C1            sub         eax,ecx
10 00DEE439 6A 0D            push        0Dh
11 00DEE43B 99               cdq
12 00DEE43C 5D               pop         ebp
13 00DEE43D F7 FD            idiv        eax,ebp ***
14 00DEE43F 43               inc         ebx
15 00DEE440 83 C7 0D         add         edi,0Dh
16 00DEE443 3B D8            cmp         ebx,eax
17 00DEE445 72 D9            jb          `anonymous namespace'::Test+1Dh (0DEE420h)

Yuck! To be fair, compiler is sometimes smart enough to recognize that size() has no side effects and vector cannot be modified during iteration, so it moves this calculation outside the loop. Couldn’t do it in this case as vector is not local. Observations:

  • for iteration prefer iterators (doh) to indices (especially if they’re raw pointers underneath, as they should… Even better on platforms with load/store+update instructions).

  • if you know array is not modified in the loop (vast majority of cases) - compute size once.

Same loop, version with iterators:

 1 00A5DC58 0F B6 06         movzx       eax,byte ptr [esi]
 2 00A5DC5B 50               push        eax
 3 00A5DC5C 68 70 3D AB 00   push        0AB3D70h
 4 00A5DC61 FF 15 10 58 AE 00 call        dword ptr [__imp__printf (0AE5810h)]
 5 00A5DC67 59               pop         ecx
 6 00A5DC68 59               pop         ecx
 7 00A5DC69 83 C6 0D         add         esi,0Dh
 8 for(rde::vector<Foo>::const_iterator i = v.begin(); i != v.end(); ++i)
 9 00A5DC6C 3B 77 04         cmp         esi,dword ptr [edi+4]
10 00A5DC6F 75 E7            jne         `anonymous namespace'::TestVIter+0Ah (0A5DC58h)

..and with size only calculated once:

 1 //for(std::size_t i = 0, s = v.size(); i < s; ++i)
 2 00C5F64E 8B 06            mov         eax,dword ptr [esi]
 3 00C5F650 0F B6 04 07      movzx       eax,byte ptr [edi+eax]
 4 00C5F654 50               push        eax
 5 00C5F655 68 6C 3D CA 00   push        0CA3D6Ch
 6 00C5F65A FF 15 10 58 CD 00 call        dword ptr [__imp__printf (0CD5810h)]
 7 00C5F660 59               pop         ecx
 8 00C5F661 83 C7 0D         add         edi,0Dh
 9 00C5F664 4B               dec         ebx
10 00C5F665 59               pop         ecx
11 00C5F666 75 E6            jne         `anonymous namespace'::TestV+1Ah (0C5F64Eh)

Finally, if you find yourself in a situation where you have to use indices and you know array size might change, consider making element’s size a power of two, so that compiler can replace idiv instruction with shift.

[Digression: GCC is actually smart enough to replace division by constant with multiplication by a “magic number” (see Hacker’s Delight). See GCC Explorer for details (I still can’t get over how awesome this tool is).]

A broader lesson from all that is we should try to avoid converting between indices & pointers if only possible. I learned it the hard way when optimizing my insert() method. Here’s the first, naive implementation (I’ll assume we’re inserting a POD type):

 1 iterator insert(iterator it, const T& val)
 2 {
 3     const size_type index = (size_type)(it - m_begin);
 4     const size_type prevSize = size();
 5     const size_type toMove = prevSize - index;
 6     if (m_end == m_capacityEnd)
 7     {
 8         grow();
 9         it = m_begin + index;
10     }
11     rde::construct(m_begin + prevSize);
12 
13     if (toMove > 0)
14     {
15         //rde::internal::move_n(it, toMove, it + 1, int_to_type<has_trivial_copy<T>::value>());
16         // Assuming PODs, this ^^^^ will result in call to:
17         memmove(result, first, n * sizeof(T));
18     }
19     *it = val;
20     ++m_end;
21     return it;
22 }

It results in this monstrosity:

 1 0130F4FC 8B 45 08         mov         eax,dword ptr [ebp+8]
 2 0130F4FF 53               push        ebx
 3 0130F500 56               push        esi
 4 0130F501 8B D9            mov         ebx,ecx
 5 0130F503 8B 33            mov         esi,dword ptr [ebx]
 6 0130F505 2B C6            sub         eax,esi
 7 0130F507 57               push        edi
 8 0130F508 99               cdq
 9 0130F509 6A 24            push        24h
10 0130F50B 59               pop         ecx
11 0130F50C F7 F9            idiv        eax,ecx (1)
12 const size_type prevSize = size();
13 0130F50E 8B 4B 04         mov         ecx,dword ptr [ebx+4]
14 0130F511 6A 24            push        24h
15 0130F513 8B F8            mov         edi,eax
16 0130F515 8B C1            mov         eax,ecx
17 0130F517 2B C6            sub         eax,esi
18 0130F519 99               cdq
19 0130F51A 5E               pop         esi
20 0130F51B F7 FE            idiv        eax,esi (2)
21 0130F51D 8B F0            mov         esi,eax
22 const size_type toMove = prevSize - index;
23 0130F51F 2B F7            sub         esi,edi
24 if (m_end == m_capacityEnd)
25 0130F521 3B 4B 08         cmp         ecx,dword ptr [ebx+8]
26 0130F524 75 0F            jne         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+3Ch (130F535h)
27 {
28 grow();
29 0130F526 8B CB            mov         ecx,ebx
30 0130F528 E8 9E 40 FC FF   call        rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::grow (12D35CBh)
31 it = m_begin + index;
32 0130F52D 6B FF 24         imul        edi,edi,24h (1m*)
33 0130F530 03 3B            add         edi,dword ptr [ebx]
34 0130F532 89 7D 08         mov         dword ptr [it],edi
35 }
36 rde::construct(m_begin + prevSize);
37 
38 if (toMove > 0)
39 0130F535 85 F6            test        esi,esi
40 0130F537 7E 15            jle         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+55h (130F54Eh)
41 {
42 rde::internal::move_n(it, toMove, it + 1, int_to_type<has_trivial_copy<T>::value>());
43 0130F539 8B 45 08         mov         eax,dword ptr [it]
44 0130F53C 6B F6 24         imul        esi,esi,24h (1/2m)
45 0130F53F 56               push        esi
46 0130F540 50               push        eax
47 0130F541 83 C0 24         add         eax,24h
48 0130F544 50               push        eax
49 0130F545 FF 15 C0 57 36 01 call        dword ptr [__imp__memmove (13657C0h)]
50 0130F54B 83 C4 0C         add         esp,0Ch
51 }
52 *it = val;
53 0130F54E 8B 75 0C         mov         esi,dword ptr [val]
54 0130F551 8B 7D 08         mov         edi,dword ptr [it]
55 return it;
56 0130F554 8B 45 08         mov         eax,dword ptr [it]
57 0130F557 6A 09            push        9
58 0130F559 59               pop         ecx
59 0130F55A F3 A5            rep movs    dword ptr es:[edi],dword ptr [esi]
60 0130F55C 83 43 04 24      add         dword ptr [ebx+4],24h

2 divs, 2 muls in the worst case (1 on average)… Some of these conversions are hard to avoid, but in some cases we convert back & forth (e.g. we first divide to get ‘toMove’, but later need to obtain # of bytes, so have to multiply it again).

Here’s the same function (POD case only, branch resolved during compilation), but this time we’re actively trying to avoid conversions:

 1 if (m_end == m_capacityEnd)
 2 {
 3     const size_type index = (size_type)(it - m_begin);
 4     grow();
 5     it = m_begin + index;
 6 }
 7 rde::construct(m_end);
 8 
 9 if (m_end > it) // only showing POD case (resolved at compile time)
10 {
11     const size_t n = reinterpret_cast<uintptr_t>(m_end) - reinterpret_cast<uintptr_t>(it);
12     Sys::MemMove(it + 1, it, n);
13 }
14 *it = val;
15 ++m_end;
16 return it;
17 }
Assembly:
 1 0086F434 8B D9            mov         ebx,ecx
 2 if (m_end == m_capacityEnd)
 3 0086F436 8B 43 04         mov         eax,dword ptr [ebx+4]
 4 0086F439 56               push        esi
 5 0086F43A 57               push        edi
 6 0086F43B 3B 43 08         cmp         eax,dword ptr [ebx+8]
 7 0086F43E 75 20            jne         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+30h (86F460h)
 8 {
 9 const size_type index = (size_type)(it - m_begin);
10 0086F440 8B 45 08         mov         eax,dword ptr [it]
11 0086F443 2B 03            sub         eax,dword ptr [ebx]
12 0086F445 6A 24            push        24h
13 0086F447 99               cdq
14 0086F448 59               pop         ecx
15 0086F449 F7 F9            idiv        eax,ecx
16 grow();
17 0086F44B 8B CB            mov         ecx,ebx
18 0086F44D 8B F0            mov         esi,eax
19 0086F44F E8 77 41 FC FF   call        rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::grow (8335CBh)
20 it = m_begin + index;
21 0086F454 6B F6 24         imul        esi,esi,24h
22 0086F457 03 33            add         esi,dword ptr [ebx]
23 0086F459 8B FE            mov         edi,esi
24 0086F45B 89 7D 08         mov         dword ptr [it],edi
25 0086F45E EB 03            jmp         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+33h (86F463h)
26 0086F460 8B 7D 08         mov         edi,dword ptr [it]
27 }
28 rde::construct(m_end);
29 
30 if (m_end > it) // only showing POD case (resolved at compile time)
31 0086F463 8B 43 04         mov         eax,dword ptr [ebx+4]
32 0086F466 3B C7            cmp         eax,edi
33 0086F468 76 11            jbe         rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+4Bh (86F47Bh)
34 {
35 const size_t n = reinterpret_cast<uintptr_t>(m_end) - reinterpret_cast<uintptr_t>(it);
36 0086F46A 2B C7            sub         eax,edi
37 Sys::MemMove(it + 1, it, n);
38 0086F46C 50               push        eax
39 0086F46D 8D 47 24         lea         eax,[edi+24h]
40 0086F470 57               push        edi
41 0086F471 50               push        eax
42 0086F472 FF 15 C0 57 8C 00 call        dword ptr [__imp__memmove (8C57C0h)]
43 0086F478 83 C4 0C         add         esp,0Ch
44 }
45 *it = val;
46 0086F47B 8B 75 0C         mov         esi,dword ptr [val]
47 return it;
48 0086F47E 8B 45 08         mov         eax,dword ptr [it]
49 0086F481 6A 09            push        9
50 0086F483 59               pop         ecx
51 0086F484 F3 A5            rep movs    dword ptr es:[edi],dword ptr [esi]
52 0086F486 83 43 04 24      add         dword ptr [ebx+4],24h

Down to 1 div/1 mul in the worst case, no divisions/multiplication in the typical case.

Obviously, these are all micro-optimizations, hard to expect spectacular gains. Still, it’s good to keep it in mind and remember how tiny changes in high level code can result in vastly different assembly being generated.

Old comments

Peng 2013-04-19 15:53:39

Hi,
I cannot understand one of copy_n methods, could u shed some light on it?

template
void copy_n(const T* first, size_t n, T* result, int_to_type)
{
const T* last = first + n;
//while (first != last)
// *result++ = *first++;
switch (n & 0x3)
{
case 0:
while (first != last)
{
*result++ = *first++;
case 3: *result++ = *first++;
case 2: *result++ = *first++;
case 1: *result++ = *first++;
}
}
}

Peng 2013-04-19 15:57:26

To be precise, i cannot understand why there’s a while loop in case 0 that includes case 1/2/3.

admin 2013-04-19 17:01:59

That’s Duff’s Device. I described it in more detail here: http://msinilo.pl/blog/?p=504 .

More Reading
Newer// MemTracer 64
comments powered by Disqus