Anatomy of Duff's Device
27/Oct 2009
Duff’s Device is one of the most brilliant exploits of C syntax. It’s used to unroll loops and save some cycles spent on loop ‘maintenance’. Let’s take a look at typical fill_n function:
template RDE_FORCEINLINE
void fill_n(T* first, size_t n, const T& val)
{
for (size_t i = 0; i < n; ++i)
first[i] = val;
}
; for (int i = 0; i < n; ++i)
mov ecx, DWORD PTR _n$[ebp]
test ecx, ecx
jle SHORT $LN1@NormalTest
mov eax, DWORD PTR _first$[ebp]
add eax, 8
$LL3@NormalTest:
; first[i] = val;
mov dl, BYTE PTR _val$[ebp+4]
movss xmm0, DWORD PTR _val$[ebp]
mov BYTE PTR [eax-4], dl
mov edx, DWORD PTR _val$[ebp+8]
movss DWORD PTR [eax-8], xmm0
mov DWORD PTR [eax], edx
add eax, 12 ; 0000000cH
dec ecx
jne SHORT $LL3@NormalTest
$LN1@NormalTest:
Simple enough. Apart from the job that really has to be done every loop (copying stuff) we’ve 3 additional instructions per loop (add eax, 12/dec ecx + jump). Could this overhead be reduced somehow?
Duff’s Device has been introduced around 1983, back when every instruction mattered. Here is the original post by Tom Duff. Almost 30 years have passed, do such little tricks still matter nowadays? Let’s try… Here’s my version of Duff’s Device, with 4-x loop unrolling:
T* last = first + n;
switch (n & 0x3)
{
case 0:
while (first != last)
{
*first = val; ++first;
case 3: *first = val; ++first;
case 2: *first = val; ++first;
case 1: *first = val; ++first;
}
}
mov edx, DWORD PTR _n$[ebp]
mov eax, DWORD PTR _first$[ebp]
; 627 : switch (n & 0x3)
movss xmm0, DWORD PTR _val$[ebp]
mov ecx, edx
imul ecx, 12 ; 0000000cH
push ebx
mov bl, BYTE PTR _val$[ebp+4]
and edx, 3
add ecx, eax
sub edx, 0
push esi
mov esi, DWORD PTR _val$[ebp+8]
je SHORT $LN5@DuffsDevic
dec edx
je SHORT $LN1@DuffsDevic
dec edx
je SHORT $LN2@DuffsDevic
dec edx
je SHORT $LN3@DuffsDevic
$LN4@DuffsDevic:
pop esi
pop ebx
pop ebp
ret 0
$LN5@DuffsDevic:
; 629 : case 0:
; 630 : while (first != last)
cmp eax, ecx
je SHORT $LN4@DuffsDevic
; 632 : *first = val; ++first;
movss DWORD PTR [eax], xmm0
mov BYTE PTR [eax+4], bl
mov DWORD PTR [eax+8], esi
add eax, 12 ; 0000000cH
$LN3@DuffsDevic:
; 633 : case 3: *first = val; ++first;
movss DWORD PTR [eax], xmm0
mov BYTE PTR [eax+4], bl
mov DWORD PTR [eax+8], esi
add eax, 12 ; 0000000cH
$LN2@DuffsDevic:
; 634 : case 2: *first = val; ++first;
movss DWORD PTR [eax], xmm0
mov BYTE PTR [eax+4], bl
mov DWORD PTR [eax+8], esi
add eax, 12 ; 0000000cH
$LN1@DuffsDevic:
; 635 : case 1: *first = val; ++first;
movss DWORD PTR [eax], xmm0
mov BYTE PTR [eax+4], bl
mov DWORD PTR [eax+8], esi
add eax, 12 ; 0000000cH
jmp SHORT $LN5@DuffsDevic
It’s actually easier to see what’s going on here. For 8x unrolled loop MSVC will use jump table instead of 8 comparisons at the start.
OK, so it’s a funky way of unrolling loops, but does it make sense to use it these days? I performed 50 loops of copying of 200002 (200k + 2) elements via ‘normal’ loop and Duff’s Device on my 2.2GHz laptop and got the following results:
standard loop: 70ms,
Duff’s Device (4x): 57ms,
hand unrolled loop: 54ms
(Would be interesting to see this on PPC) Obviously, the most important factor is: how expensive is the operation we’re performing per loop iteration. The more expensive it is - the smaller gains from unrolling, as overall cost will be dominated by loop body. I’d say that nowadays DD is more of a curiosity than something you should be using in production code, but it’s still good to know what kind of tricks you can pull off in pure ANSI C (+ it gives you better understanding of switch/case statements).
Old comments
Tomasz Dabrowski 2009-10-28 09:22:07
I’ve done some research and MSVC seems to be very sneaky about unrolling loops. As far as I understand generated ASM code (/Ox /arch:SSE2) the routine for fill(data, n, default)
is to copy default to data[0], and then perform a rep movsd using overlapping memory block (so that data[1]=data[0], data[2]=data[1] and so on). Shouldn’t it be faster in most cases?
Dan Amerson 2009-10-28 13:10:15
I recall someone mentioning Duff’s Device on the 360 forums and it being discouraged because the switch negatively interacted with the branch prediction and caused a flush of the CPU.
dba
Sean Barrett 2009-10-28 20:23:34
Duff’s Device should strictly be a historical curiosity at this point. It’s a bad way to unroll a loop.
The problem is if you’re using an optimizing compiler; the unrolled loop interior isn’t a basic block with Duff’s Device, so the compiler can’t optimize it aggressively–all those internal branch targets prevent it from doing anything more clever than just four copies of the loop. (You can’t see that in this case because there isn’t much more clever stuff for it to do.)
So all you save is the branch overhead, you don’t save potential other savings.
Whether you should use a hand-unrolled loop (and a separate cleanup with a switch() or loop) is a separate issue.
admin 2009-10-28 21:22:55
Tomasz: that’s true and that’s also why I deliberately made the test so that it can’t use movsd (as loop operation doesn’t have to map that easily to assembly instruction in all cases).
ll3 2010-04-01 22:11:51
[…] its not gona happen! … Name (required) Mail (will not be published) (required) Website …Anatomy of Duff's Device | .mischief.mayhem.soap.Duff's Device is one of the most brilliant exploits of C syntax. It's used to unroll loops and save […]