Choices & consequences
19/Apr 2013
Spent a little bit time tweaking RDE vector class again. As I already mentioned, the container itself is not terribly fascinating, there are not too many choices here. There’s another battle going on the lower level though, it’s interesting to see how an innocent instruction like_ size()_ can be a cause of a slowdown. Typically, a vector class has 3 properties it needs to keep track of:
buffer properties (pointer and size),
number of stored elements
We can safely ignore buffer begin property - it has to be a pointer. Capacity and number of stored elements are more interesting, though. We can either store them using an integer number or as another pointer marking the end of range. What are the consequences? In the first case all the operations requiring “end” pointer will be more costly (we don’t know this pointer offhand, need to add our integer to the begin pointer). If we decide to go with the second solution – size() and/or capacity() methods will have to convert a pair of pointers into number of elements. (Third approach would be to store both, but this means additional bookkeeping, so it’s rarely used). Most implementations I’ve seen use the latter approach (pointers), as we tend to operate on array “end” quite often. The consequence is (as mentioned) – size() method now requires conversion from pointer to integer, which usually means an integer division. Consider this seemingly innocent code:
1struct Foo
2{
3 char ch[13];
4};
5void Test(const rde::vector<Foo>& v)
6{
7 for(std::size_t i = 0; i < v.size(); ++i)
8 {
9 printf("%d\n", v[i].ch[0]);
10 }
11}
Here’s generated code (inner loop only):
100DEE420 0F B6 04 0F movzx eax,byte ptr [edi+ecx]
200DEE424 50 push eax
300DEE425 68 6C 3D E1 00 push 0E13D6Ch
400DEE42A FF 15 10 58 E4 00 call dword ptr [__imp__printf (0E45810h)]
500DEE430 8B 46 04 mov eax,dword ptr [esi+4]
600DEE433 59 pop ecx
700DEE434 59 pop ecx
800DEE435 8B 0E mov ecx,dword ptr [esi]
900DEE437 2B C1 sub eax,ecx
1000DEE439 6A 0D push 0Dh
1100DEE43B 99 cdq
1200DEE43C 5D pop ebp
1300DEE43D F7 FD idiv eax,ebp ***
1400DEE43F 43 inc ebx
1500DEE440 83 C7 0D add edi,0Dh
1600DEE443 3B D8 cmp ebx,eax
1700DEE445 72 D9 jb `anonymous namespace'::Test+1Dh (0DEE420h)
Yuck! To be fair, compiler is sometimes smart enough to recognize that size() has no side effects and vector cannot be modified during iteration, so it moves this calculation outside the loop. Couldn’t do it in this case as vector is not local. Observations:
for iteration prefer iterators (doh) to indices (especially if they’re raw pointers underneath, as they should… Even better on platforms with load/store+update instructions).
if you know array is not modified in the loop (vast majority of cases) - compute size once.
Same loop, version with iterators:
100A5DC58 0F B6 06 movzx eax,byte ptr [esi]
200A5DC5B 50 push eax
300A5DC5C 68 70 3D AB 00 push 0AB3D70h
400A5DC61 FF 15 10 58 AE 00 call dword ptr [__imp__printf (0AE5810h)]
500A5DC67 59 pop ecx
600A5DC68 59 pop ecx
700A5DC69 83 C6 0D add esi,0Dh
8for(rde::vector<Foo>::const_iterator i = v.begin(); i != v.end(); ++i)
900A5DC6C 3B 77 04 cmp esi,dword ptr [edi+4]
1000A5DC6F 75 E7 jne `anonymous namespace'::TestVIter+0Ah (0A5DC58h)
..and with size only calculated once:
1//for(std::size_t i = 0, s = v.size(); i < s; ++i)
200C5F64E 8B 06 mov eax,dword ptr [esi]
300C5F650 0F B6 04 07 movzx eax,byte ptr [edi+eax]
400C5F654 50 push eax
500C5F655 68 6C 3D CA 00 push 0CA3D6Ch
600C5F65A FF 15 10 58 CD 00 call dword ptr [__imp__printf (0CD5810h)]
700C5F660 59 pop ecx
800C5F661 83 C7 0D add edi,0Dh
900C5F664 4B dec ebx
1000C5F665 59 pop ecx
1100C5F666 75 E6 jne `anonymous namespace'::TestV+1Ah (0C5F64Eh)
Finally, if you find yourself in a situation where you have to use indices and you know array size might change, consider making element’s size a power of two, so that compiler can replace idiv instruction with shift.
[Digression: GCC is actually smart enough to replace division by constant with multiplication by a “magic number” (see Hacker’s Delight). See GCC Explorer for details (I still can’t get over how awesome this tool is).]
A broader lesson from all that is we should try to avoid converting between indices & pointers if only possible. I learned it the hard way when optimizing my insert() method. Here’s the first, naive implementation (I’ll assume we’re inserting a POD type):
1iterator insert(iterator it, const T& val)
2{
3 const size_type index = (size_type)(it - m_begin);
4 const size_type prevSize = size();
5 const size_type toMove = prevSize - index;
6 if (m_end == m_capacityEnd)
7 {
8 grow();
9 it = m_begin + index;
10 }
11 rde::construct(m_begin + prevSize);
12
13 if (toMove > 0)
14 {
15 //rde::internal::move_n(it, toMove, it + 1, int_to_type<has_trivial_copy<T>::value>());
16 // Assuming PODs, this ^^^^ will result in call to:
17 memmove(result, first, n * sizeof(T));
18 }
19 *it = val;
20 ++m_end;
21 return it;
22}
It results in this monstrosity:
10130F4FC 8B 45 08 mov eax,dword ptr [ebp+8]
20130F4FF 53 push ebx
30130F500 56 push esi
40130F501 8B D9 mov ebx,ecx
50130F503 8B 33 mov esi,dword ptr [ebx]
60130F505 2B C6 sub eax,esi
70130F507 57 push edi
80130F508 99 cdq
90130F509 6A 24 push 24h
100130F50B 59 pop ecx
110130F50C F7 F9 idiv eax,ecx (1)
12const size_type prevSize = size();
130130F50E 8B 4B 04 mov ecx,dword ptr [ebx+4]
140130F511 6A 24 push 24h
150130F513 8B F8 mov edi,eax
160130F515 8B C1 mov eax,ecx
170130F517 2B C6 sub eax,esi
180130F519 99 cdq
190130F51A 5E pop esi
200130F51B F7 FE idiv eax,esi (2)
210130F51D 8B F0 mov esi,eax
22const size_type toMove = prevSize - index;
230130F51F 2B F7 sub esi,edi
24if (m_end == m_capacityEnd)
250130F521 3B 4B 08 cmp ecx,dword ptr [ebx+8]
260130F524 75 0F jne rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+3Ch (130F535h)
27{
28grow();
290130F526 8B CB mov ecx,ebx
300130F528 E8 9E 40 FC FF call rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::grow (12D35CBh)
31it = m_begin + index;
320130F52D 6B FF 24 imul edi,edi,24h (1m*)
330130F530 03 3B add edi,dword ptr [ebx]
340130F532 89 7D 08 mov dword ptr [it],edi
35}
36rde::construct(m_begin + prevSize);
37
38if (toMove > 0)
390130F535 85 F6 test esi,esi
400130F537 7E 15 jle rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+55h (130F54Eh)
41{
42rde::internal::move_n(it, toMove, it + 1, int_to_type<has_trivial_copy<T>::value>());
430130F539 8B 45 08 mov eax,dword ptr [it]
440130F53C 6B F6 24 imul esi,esi,24h (1/2m)
450130F53F 56 push esi
460130F540 50 push eax
470130F541 83 C0 24 add eax,24h
480130F544 50 push eax
490130F545 FF 15 C0 57 36 01 call dword ptr [__imp__memmove (13657C0h)]
500130F54B 83 C4 0C add esp,0Ch
51}
52*it = val;
530130F54E 8B 75 0C mov esi,dword ptr [val]
540130F551 8B 7D 08 mov edi,dword ptr [it]
55return it;
560130F554 8B 45 08 mov eax,dword ptr [it]
570130F557 6A 09 push 9
580130F559 59 pop ecx
590130F55A F3 A5 rep movs dword ptr es:[edi],dword ptr [esi]
600130F55C 83 43 04 24 add dword ptr [ebx+4],24h
2 divs, 2 muls in the worst case (1 on average)… Some of these conversions are hard to avoid, but in some cases we convert back & forth (e.g. we first divide to get ‘toMove’, but later need to obtain # of bytes, so have to multiply it again).
Here’s the same function (POD case only, branch resolved during compilation), but this time we’re actively trying to avoid conversions:
1if (m_end == m_capacityEnd)
2{
3 const size_type index = (size_type)(it - m_begin);
4 grow();
5 it = m_begin + index;
6}
7rde::construct(m_end);
8
9if (m_end > it) // only showing POD case (resolved at compile time)
10{
11 const size_t n = reinterpret_cast<uintptr_t>(m_end) - reinterpret_cast<uintptr_t>(it);
12 Sys::MemMove(it + 1, it, n);
13}
14*it = val;
15++m_end;
16return it;
17}
10086F434 8B D9 mov ebx,ecx
2if (m_end == m_capacityEnd)
30086F436 8B 43 04 mov eax,dword ptr [ebx+4]
40086F439 56 push esi
50086F43A 57 push edi
60086F43B 3B 43 08 cmp eax,dword ptr [ebx+8]
70086F43E 75 20 jne rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+30h (86F460h)
8{
9const size_type index = (size_type)(it - m_begin);
100086F440 8B 45 08 mov eax,dword ptr [it]
110086F443 2B 03 sub eax,dword ptr [ebx]
120086F445 6A 24 push 24h
130086F447 99 cdq
140086F448 59 pop ecx
150086F449 F7 F9 idiv eax,ecx
16grow();
170086F44B 8B CB mov ecx,ebx
180086F44D 8B F0 mov esi,eax
190086F44F E8 77 41 FC FF call rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::grow (8335CBh)
20it = m_begin + index;
210086F454 6B F6 24 imul esi,esi,24h
220086F457 03 33 add esi,dword ptr [ebx]
230086F459 8B FE mov edi,esi
240086F45B 89 7D 08 mov dword ptr [it],edi
250086F45E EB 03 jmp rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+33h (86F463h)
260086F460 8B 7D 08 mov edi,dword ptr [it]
27}
28rde::construct(m_end);
29
30if (m_end > it) // only showing POD case (resolved at compile time)
310086F463 8B 43 04 mov eax,dword ptr [ebx+4]
320086F466 3B C7 cmp eax,edi
330086F468 76 11 jbe rde::vector<`anonymous namespace'::MyStruct,rde::allocator,rde::standard_vector_storage<`anonymous namespace'::MyStruct,rde::allocator> >::insert+4Bh (86F47Bh)
34{
35const size_t n = reinterpret_cast<uintptr_t>(m_end) - reinterpret_cast<uintptr_t>(it);
360086F46A 2B C7 sub eax,edi
37Sys::MemMove(it + 1, it, n);
380086F46C 50 push eax
390086F46D 8D 47 24 lea eax,[edi+24h]
400086F470 57 push edi
410086F471 50 push eax
420086F472 FF 15 C0 57 8C 00 call dword ptr [__imp__memmove (8C57C0h)]
430086F478 83 C4 0C add esp,0Ch
44}
45*it = val;
460086F47B 8B 75 0C mov esi,dword ptr [val]
47return it;
480086F47E 8B 45 08 mov eax,dword ptr [it]
490086F481 6A 09 push 9
500086F483 59 pop ecx
510086F484 F3 A5 rep movs dword ptr es:[edi],dword ptr [esi]
520086F486 83 43 04 24 add dword ptr [ebx+4],24h
Down to 1 div/1 mul in the worst case, no divisions/multiplication in the typical case.
Obviously, these are all micro-optimizations, hard to expect spectacular gains. Still, it’s good to keep it in mind and remember how tiny changes in high level code can result in vastly different assembly being generated.
Old comments
Peng 2013-04-19 15:53:39
Hi,
I cannot understand one of copy_n methods, could u shed some light on it?
”
template
void copy_n(const T* first, size_t n, T* result, int_to_type)
{
const T* last = first + n;
//while (first != last)
// *result++ = *first++;
switch (n & 0x3)
{
case 0:
while (first != last)
{
*result++ = *first++;
case 3: *result++ = *first++;
case 2: *result++ = *first++;
case 1: *result++ = *first++;
}
}
}
“
Peng 2013-04-19 15:57:26
To be precise, i cannot understand why there’s a while loop in case 0 that includes case 1/2/3.
admin 2013-04-19 17:01:59
That’s Duff’s Device. I described it in more detail here: http://msinilo.pl/blog/?p=504 .