Anatomy of Duff’s Device

October 27, 2009 – 6:04 pm

Duff’s Device is one of the most brilliant exploits of C syntax. It’s used to unroll loops and save some cycles spent on loop ‘maintenance’. Let’s take a look at typical fill_n function:

template<typename T> RDE_FORCEINLINE
void fill_n(T* first, size_t n, const T& val)
{
    for (size_t i = 0; i < n; ++i)
        first[i] = val;
}

Now, assuming that element is not too expensive to copy, execution time of this function can be heavily influenced just by loop ‘maintenance’ I mentioned (counter modifications + jumps). Let’s see how does it look in assembly (12 bytes element, no funky copy constructors, so cheap to copy):

; 621  :     for (int i = 0; i < n; ++i)
mov    ecx, DWORD PTR _n$[ebp]
test   ecx, ecx
jle    SHORT $LN1@NormalTest
mov    eax, DWORD PTR _first$[ebp]
add    eax, 8
$LL3@NormalTest:
; 622  :         first[i] = val;
mov    dl, BYTE PTR _val$[ebp+4]
movss  xmm0, DWORD PTR _val$[ebp]
mov    BYTE PTR [eax-4], dl
mov    edx, DWORD PTR _val$[ebp+8]
movss  DWORD PTR [eax-8], xmm0
mov    DWORD PTR [eax], edx
add    eax, 12                    ; 0000000cH
dec    ecx
jne    SHORT $LL3@NormalTest
$LN1@NormalTest:

Simple enough. Apart from the job that really has to be done every loop (copying stuff) we’ve 3 additional instructions per loop (add eax, 12/dec ecx + jump). Could this overhead be reduced somehow?

Duff’s Device has been introduced around 1983, back when every instruction mattered. Here is the original post by Tom Duff. Almost 30 years have passed, do such little tricks still matter nowadays? Let’s try… Here’s my version of Duff’s Device, with 4-x loop unrolling:

T* last = first + n;
switch (n & 0x3)
{
case 0:
    while (first != last)
    {
          *first = val; ++first;
case 3:  *first = val; ++first;
case 2:  *first = val; ++first;
case 1:  *first = val; ++first;
    }
}

It can be a little bit confusing if you see it for the first time (..or second, or third, starts to make sense around 10th). Should be easier, if you ever programmed in x86 assembler. Basic idea behind Duff’s Device is almost the same as behind copying non-multiples of 4 bytes with movsd & movsb. First, you’d calculate the number of 4-byte packages to copy with movsd (n >> 2), then copy the rest (n & 0×3) with movsb.  What makes things a little bit more complicated here, is the way DD exploits switch construction. We calculate number of “leftover” items to copy (3/2/1 or 0), copy them first and then proceed with 4 elements per loop. Let’s see generated assembly:

mov    edx, DWORD PTR _n$[ebp]
mov    eax, DWORD PTR _first$[ebp]
; 627  :     switch (n & 0x3)
movss  xmm0, DWORD PTR _val$[ebp]
mov    ecx, edx
imul   ecx, 12                    ; 0000000cH
push   ebx
mov    bl, BYTE PTR _val$[ebp+4]
and    edx, 3
add    ecx, eax
sub    edx, 0
push   esi
mov    esi, DWORD PTR _val$[ebp+8]
je     SHORT $LN5@DuffsDevic
dec    edx
je     SHORT $LN1@DuffsDevic
dec    edx
je     SHORT $LN2@DuffsDevic
dec    edx
je     SHORT $LN3@DuffsDevic
$LN4@DuffsDevic:
pop    esi
pop    ebx
pop    ebp
ret    0
$LN5@DuffsDevic:
; 629  :     case 0:
; 630  :         while (first != last)
cmp    eax, ecx
je     SHORT $LN4@DuffsDevic
; 632  :             *first = val; ++first;
movss  DWORD PTR [eax], xmm0
mov    BYTE PTR [eax+4], bl
mov    DWORD PTR [eax+8], esi
add    eax, 12                    ; 0000000cH
$LN3@DuffsDevic:
; 633  :     case 3:    *first = val; ++first;
movss  DWORD PTR [eax], xmm0
mov    BYTE PTR [eax+4], bl
mov    DWORD PTR [eax+8], esi
add    eax, 12                    ; 0000000cH
$LN2@DuffsDevic:
; 634  :     case 2:    *first = val; ++first;
movss  DWORD PTR [eax], xmm0
mov    BYTE PTR [eax+4], bl
mov    DWORD PTR [eax+8], esi
add    eax, 12                    ; 0000000cH
$LN1@DuffsDevic:
; 635  :     case 1:    *first = val; ++first;
movss  DWORD PTR [eax], xmm0
mov    BYTE PTR [eax+4], bl
mov    DWORD PTR [eax+8], esi
add    eax, 12                    ; 0000000cH
jmp    SHORT $LN5@DuffsDevic

It’s actually easier to see what’s going on here. For 8x unrolled loop MSVC will use jump table instead of 8 comparisons at the start.

OK, so it’s a funky way of unrolling loops, but does it make sense to use it these days? I performed 50 loops of copying of 200002 (200k + 2) elements via ‘normal’ loop and Duff’s Device on my 2.2GHz laptop and got the following results:

  • standard loop: 70ms,
  • Duff’s Device (4x): 57ms,
  • hand unrolled loop: 54ms

(Would be interesting to see this on PPC)
Obviously, the most important factor is: how expensive is the operation we’re performing per loop iteration. The more expensive it is – the smaller gains from unrolling, as overall cost will be dominated by loop body. I’d say that nowadays DD is more of a curiosity than something you should be using in production code, but it’s still good to know what kind of tricks you can pull off in pure ANSI C (+ it gives you better understanding of switch/case statements).

  1. 5 Responses to “Anatomy of Duff’s Device”

  2. I’ve done some research and MSVC seems to be very sneaky about unrolling loops. As far as I understand generated ASM code (/Ox /arch:SSE2) the routine for fill(data, n, default)
    is to copy default to data[0], and then perform a rep movsd using overlapping memory block (so that data[1]=data[0], data[2]=data[1] and so on). Shouldn’t it be faster in most cases?

    By Tomasz Dąbrowski on Oct 28, 2009

  3. I recall someone mentioning Duff’s Device on the 360 forums and it being discouraged because the switch negatively interacted with the branch prediction and caused a flush of the CPU.

    dba

    By Dan Amerson on Oct 28, 2009

  4. Duff’s Device should strictly be a historical curiosity at this point. It’s a bad way to unroll a loop.

    The problem is if you’re using an optimizing compiler; the unrolled loop interior isn’t a basic block with Duff’s Device, so the compiler can’t optimize it aggressively–all those internal branch targets prevent it from doing anything more clever than just four copies of the loop. (You can’t see that in this case because there isn’t much more clever stuff for it to do.)

    So all you save is the branch overhead, you don’t save potential other savings.

    Whether you should use a hand-unrolled loop (and a separate cleanup with a switch() or loop) is a separate issue.

    By Sean Barrett on Oct 28, 2009

  5. Tomasz: that’s true and that’s also why I deliberately made the test so that it can’t use movsd (as loop operation doesn’t have to map that easily to assembly instruction in all cases).

    By admin on Oct 28, 2009

  1. 1 Trackback(s)

  2. Apr 1, 2010: ll3

Post a Comment