Anatomy of Duff’s Device
October 27, 2009 – 6:04 pmDuff’s Device is one of the most brilliant exploits of C syntax. It’s used to unroll loops and save some cycles spent on loop ‘maintenance’. Let’s take a look at typical fill_n function:
template<typename T> RDE_FORCEINLINE
void fill_n(T* first, size_t n, const T& val)
{
for (size_t i = 0; i < n; ++i)
first[i] = val;
}
Now, assuming that element is not too expensive to copy, execution time of this function can be heavily influenced just by loop ‘maintenance’ I mentioned (counter modifications + jumps). Let’s see how does it look in assembly (12 bytes element, no funky copy constructors, so cheap to copy):
; 621 : for (int i = 0; i < n; ++i) mov ecx, DWORD PTR _n$[ebp] test ecx, ecx jle SHORT $LN1@NormalTest mov eax, DWORD PTR _first$[ebp] add eax, 8 $LL3@NormalTest: ; 622 : first[i] = val; mov dl, BYTE PTR _val$[ebp+4] movss xmm0, DWORD PTR _val$[ebp] mov BYTE PTR [eax-4], dl mov edx, DWORD PTR _val$[ebp+8] movss DWORD PTR [eax-8], xmm0 mov DWORD PTR [eax], edx add eax, 12 ; 0000000cH dec ecx jne SHORT $LL3@NormalTest $LN1@NormalTest:
Simple enough. Apart from the job that really has to be done every loop (copying stuff) we’ve 3 additional instructions per loop (add eax, 12/dec ecx + jump). Could this overhead be reduced somehow?
Duff’s Device has been introduced around 1983, back when every instruction mattered. Here is the original post by Tom Duff. Almost 30 years have passed, do such little tricks still matter nowadays? Let’s try… Here’s my version of Duff’s Device, with 4-x loop unrolling:
T* last = first + n;
switch (n & 0x3)
{
case 0:
while (first != last)
{
*first = val; ++first;
case 3: *first = val; ++first;
case 2: *first = val; ++first;
case 1: *first = val; ++first;
}
}
It can be a little bit confusing if you see it for the first time (..or second, or third, starts to make sense around 10th). Should be easier, if you ever programmed in x86 assembler. Basic idea behind Duff’s Device is almost the same as behind copying non-multiples of 4 bytes with movsd & movsb. First, you’d calculate the number of 4-byte packages to copy with movsd (n >> 2), then copy the rest (n & 0×3) with movsb. What makes things a little bit more complicated here, is the way DD exploits switch construction. We calculate number of “leftover” items to copy (3/2/1 or 0), copy them first and then proceed with 4 elements per loop. Let’s see generated assembly:
mov edx, DWORD PTR _n$[ebp] mov eax, DWORD PTR _first$[ebp] ; 627 : switch (n & 0x3) movss xmm0, DWORD PTR _val$[ebp] mov ecx, edx imul ecx, 12 ; 0000000cH push ebx mov bl, BYTE PTR _val$[ebp+4] and edx, 3 add ecx, eax sub edx, 0 push esi mov esi, DWORD PTR _val$[ebp+8] je SHORT $LN5@DuffsDevic dec edx je SHORT $LN1@DuffsDevic dec edx je SHORT $LN2@DuffsDevic dec edx je SHORT $LN3@DuffsDevic $LN4@DuffsDevic: pop esi pop ebx pop ebp ret 0 $LN5@DuffsDevic: ; 629 : case 0: ; 630 : while (first != last) cmp eax, ecx je SHORT $LN4@DuffsDevic ; 632 : *first = val; ++first; movss DWORD PTR [eax], xmm0 mov BYTE PTR [eax+4], bl mov DWORD PTR [eax+8], esi add eax, 12 ; 0000000cH $LN3@DuffsDevic: ; 633 : case 3: *first = val; ++first; movss DWORD PTR [eax], xmm0 mov BYTE PTR [eax+4], bl mov DWORD PTR [eax+8], esi add eax, 12 ; 0000000cH $LN2@DuffsDevic: ; 634 : case 2: *first = val; ++first; movss DWORD PTR [eax], xmm0 mov BYTE PTR [eax+4], bl mov DWORD PTR [eax+8], esi add eax, 12 ; 0000000cH $LN1@DuffsDevic: ; 635 : case 1: *first = val; ++first; movss DWORD PTR [eax], xmm0 mov BYTE PTR [eax+4], bl mov DWORD PTR [eax+8], esi add eax, 12 ; 0000000cH jmp SHORT $LN5@DuffsDevic
It’s actually easier to see what’s going on here. For 8x unrolled loop MSVC will use jump table instead of 8 comparisons at the start.
OK, so it’s a funky way of unrolling loops, but does it make sense to use it these days? I performed 50 loops of copying of 200002 (200k + 2) elements via ‘normal’ loop and Duff’s Device on my 2.2GHz laptop and got the following results:
- standard loop: 70ms,
- Duff’s Device (4x): 57ms,
- hand unrolled loop: 54ms
(Would be interesting to see this on PPC)
Obviously, the most important factor is: how expensive is the operation we’re performing per loop iteration. The more expensive it is – the smaller gains from unrolling, as overall cost will be dominated by loop body. I’d say that nowadays DD is more of a curiosity than something you should be using in production code, but it’s still good to know what kind of tricks you can pull off in pure ANSI C (+ it gives you better understanding of switch/case statements).










5 Responses to “Anatomy of Duff’s Device”
I’ve done some research and MSVC seems to be very sneaky about unrolling loops. As far as I understand generated ASM code (/Ox /arch:SSE2) the routine for fill(data, n, default)
is to copy default to data[0], and then perform a rep movsd using overlapping memory block (so that data[1]=data[0], data[2]=data[1] and so on). Shouldn’t it be faster in most cases?
By Tomasz Dąbrowski on Oct 28, 2009
I recall someone mentioning Duff’s Device on the 360 forums and it being discouraged because the switch negatively interacted with the branch prediction and caused a flush of the CPU.
dba
By Dan Amerson on Oct 28, 2009
Duff’s Device should strictly be a historical curiosity at this point. It’s a bad way to unroll a loop.
The problem is if you’re using an optimizing compiler; the unrolled loop interior isn’t a basic block with Duff’s Device, so the compiler can’t optimize it aggressively–all those internal branch targets prevent it from doing anything more clever than just four copies of the loop. (You can’t see that in this case because there isn’t much more clever stuff for it to do.)
So all you save is the branch overhead, you don’t save potential other savings.
Whether you should use a hand-unrolled loop (and a separate cleanup with a switch() or loop) is a separate issue.
By Sean Barrett on Oct 28, 2009
Tomasz: that’s true and that’s also why I deliberately made the test so that it can’t use movsd (as loop operation doesn’t have to map that easily to assembly instruction in all cases).
By admin on Oct 28, 2009