Jump to content

Duff's device: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Codemonkey (talk | contribs)
added slashdotted template, see talk page
No edit summary
Line 20: Line 20:
}
}


Based on a widely-used algorithm used by [[assembler]]s for minimizing the number of tests and branches during a copy, it appears out of place when implemented in C. The device is indeed perfectly valid, legal C, and many compilers will optimize the switch into a jump table just as would be done in an assembler implementation. C's default fall-through in case statements has long been its most controversial single feature; Duff observed that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."
Based on an algorithm used widely by [[assembler]]s for minimizing the number of tests and branches during a copy, '''Duff's Device''' appears out of place when implemented in C. The device is indeed perfectly valid, legal C, and many compilers will optimize the switch into a jump table just as would be done in an assembler implementation. C's default fall-through in case statements has long been its most controversial single feature; Duff observed that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."


Why is this faster than a simple, straightforward loop? The primary increase in speed comes from [[Loop unrolling]], which reduces the number of comparisons performed during a loop. The switch..case statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, 8 byte moves are unrolled, so the switch/case handles an extra 1-7 bytes automatically.)
Why is this faster than a simple, straightforward loop? The primary increase in speed comes from [[Loop unrolling]], which reduces the number of comparisons performed during a loop. The switch..case statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, 8 byte moves are unrolled, so the switch/case handles an extra 1-7 bytes automatically.)

Revision as of 03:08, 7 October 2005

In computer science, Duff's Device is an optimized implementation of a serial copy that uses a technique widely applied in assembly language for loop unwinding. Its discovery is credited to Tom Duff in November of 1983, who at the time was working for Lucasfilm. It is perhaps the most dramatic use of case label fall-through in the C programming language to date. Note that Duff does not claim credit for discovering the concept of loop unrolling, just this particular expression of it in C.

While optimizing a serial copy, Duff realized that an unrolled version of his loop could be implemented by interlacing the structures of a switch and a loop:

 int n = (count + 7) / 8;      /* count > 0 assumed */
  
 switch (count % 8)
 {
 case 0:        do {  *to = *from++;
 case 7:              *to = *from++;
 case 6:              *to = *from++;
 case 5:              *to = *from++;
 case 4:              *to = *from++;
 case 3:              *to = *from++;
 case 2:              *to = *from++;
 case 1:              *to = *from++;
                    } while (--n > 0);
 }

Based on an algorithm used widely by assemblers for minimizing the number of tests and branches during a copy, Duff's Device appears out of place when implemented in C. The device is indeed perfectly valid, legal C, and many compilers will optimize the switch into a jump table just as would be done in an assembler implementation. C's default fall-through in case statements has long been its most controversial single feature; Duff observed that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."

Why is this faster than a simple, straightforward loop? The primary increase in speed comes from Loop unrolling, which reduces the number of comparisons performed during a loop. The switch..case statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, 8 byte moves are unrolled, so the switch/case handles an extra 1-7 bytes automatically.)

This automatic handling of the remainder may not be the best solution on all systems and compilers - in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device, although at least one person has suggested that it may also interfere with pipelining and branch prediction on some architectures. Therefore, when considering using this code, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler.

This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.

Additional note: The original Device was made for copying to a register. To actually copy memory from one location to another, you must add an auto-increment to every reference to to, like so:

 *to++ = *from++;

This modified form of the Device appears as a "what does this code do?" exercise in Bjarne Stroustrup's book The C++ Programming Language, presumably because novice programmers cannot be expected to know about memory-mapped output registers. However, it is not particularly useful in this form since the standard C library already supplies a memory-copying function which is likely to be even better optimized.

Books

  • Bjarne Stroustrup: The C++ Programming Language, Third Edition, Addison-Wesley, ISBN 0-201-88954-4