Duff's device: Difference between revisions
Added note that the Device cannot be used to copy memory as it is presented. Provided the change needed to make it do that. |
→Mechanism: The flag in question is "-funstructured-switch" |
||
(452 intermediate revisions by more than 100 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Implementation of loop unrolling in C}} |
|||
In [[computer science]], '''Duff's device''' is an [[optimization (computer science)|optimised]] [[implementation]] of a serial copy, called loop unrolling. |
|||
In the [[C (programming language)|C]] programming language, '''Duff's device''' is a way of manually implementing [[loop unrolling]] by interleaving two syntactic constructs of C: the {{mono|do}}-{{mono|while}} loop and a [[switch statement]]. Its discovery is credited to [[Tom Duff]] in November 1983, when Duff was working for [[Lucasfilm]] and used it to speed up a real-time animation program. |
|||
Loop unrolling attempts to reduce the overhead of [[conditional branching]] needed to check whether a loop is done, by executing a batch of loop bodies per iteration. To handle cases where the number of iterations is not divisible by the unrolled-loop increments, a common technique among [[assembly language]] programmers is to jump directly into the middle of the unrolled loop body to handle the remainder.<ref name="drdobbs">{{cite news |
|||
It is perhaps the most dramatic use yet seen of [[fall through]] in the [[C programming language]], and was invented by [[Tom Duff]] when he was at [[Lucasfilm]]. Whilst optimising an inner [[Program_loop#Loops|loop]] that copied data serially onto an output port, he decided to unroll it. He then realised that the unrolled version could be implemented by ''interlacing'' the structures of a switch and a loop: |
|||
| url = http://www.drdobbs.com/a-reusable-duff-device/184406208?queryText=%2522duff%2527s%2Bdevice%2522 |
|||
| title = A Reusable Duff Device |
|||
| date = August 1, 2005 | access-date = September 18, 2015 |
|||
| last = Holly | first = Ralf | newspaper = Dr. Dobb's |
|||
| publisher = [[Dr. Dobb's Journal]] |
|||
}}</ref> |
|||
Duff implemented this technique in C by using C's [[Switch statement#Fallthrough|case label fall-through]] feature to jump into the unrolled body.<ref name="lysatorduff">{{cite web |
|||
| url = https://www.lysator.liu.se/c/duffs-device.html |
|||
| title = Subject: Re: Explanation, please! |
|||
| date = August 29, 1988 | access-date = November 3, 2015 |
|||
| last = Duff | first = Tom | publisher = [[Lysator]] |
|||
}}</ref> |
|||
==Original version== |
|||
int n = (count + 7) / 8; /* count > 0 assumed */ |
|||
Duff's problem was to copy 16-bit unsigned integers ("shorts" in most C implementations) from an array into a [[memory-mapped I/O|memory-mapped output]] register, denoted in C by a [[Pointer (computer programming)|pointer]]. His original code, in [[C (programming language)|C]], looked as follows:<ref>{{Cite web|url=http://doc.cat-v.org/bell_labs/duffs_device|title=The amazing Duff's Device by Tom Duff!|website=doc.cat-v.org|access-date=2017-06-08}}</ref><ref>{{cite web|url= https://research.swtch.com/duff |title= research!rsc: On Duff's Device and Coroutines |website= research.swtch.com |first= Russ |last= Cox |date= 2008-01-30 |accessdate= 2017-01-24}}</ref> |
|||
|
|||
switch (count % 8) |
|||
{ |
|||
case 0: do { *to = *from++; |
|||
case 7: *to = *from++; |
|||
case 6: *to = *from++; |
|||
case 5: *to = *from++; |
|||
case 4: *to = *from++; |
|||
case 3: *to = *from++; |
|||
case 2: *to = *from++; |
|||
case 1: *to = *from++; |
|||
} while (--n > 0); |
|||
} |
|||
<syntaxhighlight lang="c"> |
|||
Based on a widely-used [[Assembler|assembly language]] algorithm for minimizing the number of tests and branches during a copy, it appears out of place when implemented in C. The device is indeed perfectly valid, legal C, and many compilers will optimize the switch into a jump table just as would be done in an assembler implementation. C's default fall through in case statements has long been its most controversial single feature; Duff observed that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against." |
|||
send(to, from, count) |
|||
register short *to, *from; |
|||
register count; |
|||
{ |
|||
do { /* count > 0 assumed */ |
|||
*to = *from++; |
|||
} while (--count > 0); |
|||
} |
|||
</syntaxhighlight> |
|||
This code assumes that initial {{mono|count > 0}}. Since the output location is a memory-mapped register, the pointer {{mono|to}} is not incremented as would be required for a memory-to-memory copy. |
|||
{{FOLDOC}} |
|||
If {{mono|count}} were always divisible by eight, unrolling this loop eight-fold would produce the following: |
|||
Additional note: The original Device was made for copying to a register. To actually copy memory from one location to another, you must add an auto-increment to every reference to ''to'', like so: |
|||
<syntaxhighlight lang="c"> |
|||
*to++ = *from++; |
|||
send(to, from, count) |
|||
register short *to, *from; |
|||
register count; |
|||
{ |
|||
register n = count / 8; |
|||
do { |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
} while (--n > 0); |
|||
} |
|||
</syntaxhighlight> |
|||
Duff realized that to handle cases where {{mono|count}} is not divisible by eight, the assembly programmer's technique of jumping into the loop body could be implemented by interlacing the structures of a switch statement and a loop, putting the switch's {{mono|case}} labels at the points of the loop body that correspond to the remainder of {{mono|count/8}}:<ref name="drdobbs" /> |
|||
==External links== |
|||
<syntaxhighlight lang="c"> |
|||
send(to, from, count) |
|||
register short *to, *from; |
|||
register count; |
|||
{ |
|||
register n = (count + 7) / 8; |
|||
switch (count % 8) { |
|||
case 0: do { *to = *from++; |
|||
case 7: *to = *from++; |
|||
case 6: *to = *from++; |
|||
case 5: *to = *from++; |
|||
case 4: *to = *from++; |
|||
case 3: *to = *from++; |
|||
case 2: *to = *from++; |
|||
case 1: *to = *from++; |
|||
} while (--n > 0); |
|||
} |
|||
} |
|||
</syntaxhighlight> |
|||
Duff's device can similarly be applied with any other size for the unrolled loop, not just eight as in the example above. |
|||
==Mechanism== |
|||
Based on an algorithm used widely by programmers coding in [[Assembly language|assembly]] for minimizing the number of tests and branches during a copy, Duff's device appears out of place when implemented in C. The device is valid C by virtue of two attributes in C: |
|||
# Relaxed specification of the {{mono|switch}} statement in the language's definition. At the time of the device's invention this was the first edition of ''[[The C Programming Language]]'' which requires only that the body of the {{mono|switch}} be a syntactically valid (compound) statement within which {{mono|case}} labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a {{mono|break}} statement, the flow of control will ''fall through'' from a statement controlled by one {{mono|case}} label to that controlled by the next, this means that the code specifies a succession of {{mono|count}} copies from sequential source addresses to the memory-mapped output port. |
|||
# The ability to jump into the middle of a loop in C. |
|||
This leads to what the ''[[Jargon File]]'' calls "the most dramatic use yet seen of fall through in C".{{r|Jargon}} C's default fall-through in case statements has long been one of its most controversial features; Duff himself said that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."<ref name="Jargon">[http://www.catb.org/jargon/html/D/Duffs-device.html The Jargon File]</ref> |
|||
Although valid in C, Duff's device goes against common C guidelines, such as the [[MISRA C|MISRA guidelines]]. Some compilers (e.g. [[CompCert]]) are restricted to such guidelines and thus reject Duff's device unless specifically instructed otherwise. |
|||
==Simplified explanation== |
|||
{| class="wikitable" align="right" style="margin-left: 1.5em;" |
|||
|+ |
|||
! A functionally equivalent version<br />with <code>switch</code> and <code>while</code> disentangled |
|||
|- |
|||
| <syntaxhighlight lang="c"> |
|||
send(to, from, count) |
|||
register short *to, *from; |
|||
register count; |
|||
{ |
|||
register n = (count + 7) / 8; |
|||
switch (count % 8) { |
|||
case 0: *to = *from++; |
|||
case 7: *to = *from++; |
|||
case 6: *to = *from++; |
|||
case 5: *to = *from++; |
|||
case 4: *to = *from++; |
|||
case 3: *to = *from++; |
|||
case 2: *to = *from++; |
|||
case 1: *to = *from++; |
|||
} |
|||
while (--n > 0) { |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
*to = *from++; |
|||
} |
|||
} |
|||
</syntaxhighlight> |
|||
|} |
|||
The basic idea of [[loop unrolling]] is that the number of instructions executed in a loop can be reduced by reducing the number of loop tests, sometimes reducing the amount of time spent in the loop. For example, in the case of a loop with only a single instruction in the block code, the loop test will typically be performed for every iteration of the loop, that is every time the instruction is executed. If, instead, eight copies of the same instruction are placed in the loop, then the test will be performed only every eight iterations, and this may gain time by avoiding seven tests. However, this only handles a multiple of eight iterations, requiring something else to handle any [[remainder]] of iterations.<ref name="drdobbs" /> |
|||
Duff's device provides a solution by first performing the remainder of iterations, followed by iterating as many times as necessary the multiple of eight similar instructions. To determine the number of remainder iterations, the code first calculates the total number of iterations [[Modulo operation|modulo]] eight. According to this remainder, the [[Program counter|program execution]] will then ''[[Branch (computer science)|jump]]'' to a <code>case</code> statement followed by ''exactly the number of iterations needed''. Once this is done, everything is straightforward: the code continues by doing iterations of groups of eight instructions; this has become possible since the remaining number of iterations is a multiple of eight.<ref name="drdobbs" /> |
|||
Duff's device provides a compact loop unrolling by using the case keyword ''both inside and outside the loop''. This is unusual because the contents of a case statement are traditionally thought of as a block of code nested inside the case statement, and a reader would typically expect it to end before the next case statement. According to the specifications of C language, this is not necessary; indeed, case statements can appear anywhere inside the [[Switch statement|switch]] code block, and at any depth; the program execution will simply jump to the next statement, wherever it may be. |
|||
==Performance== |
|||
Many compilers will optimize the switch into a [[branch table]] just as would be done in an assembly implementation. |
|||
The primary increase in speed versus a simple, straightforward loop, comes from [[loop unwinding]] that reduces the number of performed branches, which are computationally expensive due to the need to flush{{mdashb}}and hence stall{{mdashb}}the [[instruction pipeline]]. The <code>switch</code> statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, eight short moves are unrolled, so the <code>switch</code> handles an extra 1–7 shorts automatically). |
|||
This automatic handling of the remainder may not be the best solution on all systems and compilers{{snd}} in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and [[branch prediction]] on some architectures.<ref>[http://www.l33tskillz.org/usenix2003/notes/t-09-5/ James Ralston's USENIX 2003 Journal]{{dead link|date=December 2016 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> When numerous instances of Duff's device were removed from the [[XFree86]] Server in version 4.0, there was an improvement in performance and a noticeable reduction in size of the executable.<ref name="lkml-0008.2/0171">{{cite web |
|||
| url = http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html |
|||
| title = Re: [PATCH] Re: Move of input drivers, some word needed from you |
|||
| date = August 22, 2000 | access-date = August 22, 2014 |
|||
| last = Tso | first = Ted | publisher = [[Linux kernel mailing list]] | website = lkml.indiana.edu |
|||
| quote = Jim Gettys has a wonderful explanation of this effect in the X server. It turns out that with branch predictions and the relative speed of CPU vs. memory changing over the past decade, loop unrolling is pretty much pointless. In fact, by eliminating all instances of Duff's Device from the [[XFree86]] 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot, because the elimination of all that excess code meant that the X server wasn't thrashing the cache lines as much. |
|||
}}</ref> Therefore, when considering this code as a [[program optimization]], it may be worth running a few [[Benchmark (computing)|benchmarks]] to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler, as well as weighing the risk that the optimized code will later be used on different platforms where it is not the fastest code. |
|||
For the purpose of memory-to-memory copies (which, as mentioned above, was not the original use of Duff's device), the [[standard C library]] provides the function <code>[[memcpy]]</code>;<ref name=memcpy-cppreference>{{cite web|url=http://en.cppreference.com/enwiki/w/c/string/byte/memcpy |title=memcpy - cppreference.com |publisher=En.cppreference.com |accessdate=2014-03-06}}</ref> it will not perform worse than a memory-to-memory copy version of this code, and may contain architecture-specific optimizations that make it significantly faster.<ref name="amd2002">{{cite web |url=http://web.mit.edu/ehliu/Public/ProjectX/Meetings/AMD_block_prefetch_paper.pdf |publisher=[[Massachusetts Institute of Technology|mit.edu]] |first=Mike |last=Wall |title=Using Block Prefetch for Optimized Memory Performance |date=2002-03-19 |accessdate=2012-09-22 |archive-date=2017-08-30 |archive-url=https://web.archive.org/web/20170830073838/http://web.mit.edu/ehliu/Public/ProjectX/Meetings/AMD_block_prefetch_paper.pdf |url-status=dead }}</ref><ref name="fog2012">{{cite web|url=http://www.agner.org/optimize/optimizing_assembly.pdf|pages=100 ff|title=Optimizing subroutines in assembly language |first=Agner |last=Fog |date=2012-02-29|accessdate=2012-09-22| publisher=[[Copenhagen University College of Engineering]]}}</ref> |
|||
==See also== |
|||
{{Portal|Computer programming}} |
|||
* [[Computed GOTO]] |
|||
* [[Coroutine]]{{snd}} Duff's device can be used to implement coroutines in C/C++ (see Tatham external link) |
|||
* [[Jensen's device]] |
|||
==References== |
|||
{{Reflist|30em}} |
|||
==Further reading== |
|||
*{{cite book |last1=Kernighan |first1=Brian |authorlink1=Brian Kernighan |last2=Ritchie |first2=Dennis |authorlink2=Dennis Ritchie |date=March 1988 |title=[[The C Programming Language]] |edition=2nd |publisher=Prentice Hall |location=Englewood Cliffs, N.J. |isbn=0-13-110362-8}} |
|||
== External links == |
|||
* [http://www.lysator.liu.se/c/duffs-device.html Description and original mail by Duff at Lysator] |
* [http://www.lysator.liu.se/c/duffs-device.html Description and original mail by Duff at Lysator] |
||
* [http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html Simon Tatham's coroutines in C] utilizes the same switch/case trick |
|||
* [http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?Duff%27s+device Article at FOLDOC] |
|||
* Adam Dunkels's [https://web.archive.org/web/20051013062233/http://www.sics.se/~adam/pt/ Protothreads - Lightweight, Stackless Threads in C] also uses nested switch/case statements (see also [https://web.archive.org/web/20080828170858/http://www.eigenclass.org/hiki/threadring-with-protothreads The lightest lightweight threads, Protothreads]) |
|||
* [http://www.catb.org/~esr/jargon/html/D/Duffs-device.html Article at the Jargon File] |
|||
* [http://pigeonsnest.co.uk/stuff/pigeons-device.html Pigeon's device] is a related technique: intertwined switch/case and if/else statements |
|||
{{Prone to spam|date=November 2014}} |
|||
<!-- {{No more links}} |
|||
Please be cautious adding more external links. |
|||
Wikipedia is not a collection of links and should not be used for advertising. |
|||
Excessive or inappropriate links will be removed. |
|||
See [[Wikipedia:External links]] and [[Wikipedia:Spam]] for details. |
|||
If there are already suitable links, propose additions or replacements on |
|||
the article's talk page, or submit your link to the relevant category at |
|||
DMOZ (dmoz.org) and link there using {{Dmoz}}. --> |
|||
[[Category:C (programming language)]] |
|||
{{stub}} |
|||
[[Category:Articles with example C code]] |
|||
[[Category:Computer programming folklore]] |
|||
[[Category:Programming language folklore]] |
|||
[[Category:Source code]] |
Latest revision as of 13:17, 27 May 2024
In the C programming language, Duff's device is a way of manually implementing loop unrolling by interleaving two syntactic constructs of C: the do-while loop and a switch statement. Its discovery is credited to Tom Duff in November 1983, when Duff was working for Lucasfilm and used it to speed up a real-time animation program.
Loop unrolling attempts to reduce the overhead of conditional branching needed to check whether a loop is done, by executing a batch of loop bodies per iteration. To handle cases where the number of iterations is not divisible by the unrolled-loop increments, a common technique among assembly language programmers is to jump directly into the middle of the unrolled loop body to handle the remainder.[1] Duff implemented this technique in C by using C's case label fall-through feature to jump into the unrolled body.[2]
Original version
[edit]Duff's problem was to copy 16-bit unsigned integers ("shorts" in most C implementations) from an array into a memory-mapped output register, denoted in C by a pointer. His original code, in C, looked as follows:[3][4]
send(to, from, count)
register short *to, *from;
register count;
{
do { /* count > 0 assumed */
*to = *from++;
} while (--count > 0);
}
This code assumes that initial count > 0. Since the output location is a memory-mapped register, the pointer to is not incremented as would be required for a memory-to-memory copy.
If count were always divisible by eight, unrolling this loop eight-fold would produce the following:
send(to, from, count)
register short *to, *from;
register count;
{
register n = count / 8;
do {
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
} while (--n > 0);
}
Duff realized that to handle cases where count is not divisible by eight, the assembly programmer's technique of jumping into the loop body could be implemented by interlacing the structures of a switch statement and a loop, putting the switch's case labels at the points of the loop body that correspond to the remainder of count/8:[1]
send(to, from, count)
register short *to, *from;
register count;
{
register n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}
}
Duff's device can similarly be applied with any other size for the unrolled loop, not just eight as in the example above.
Mechanism
[edit]Based on an algorithm used widely by programmers coding in assembly for minimizing the number of tests and branches during a copy, Duff's device appears out of place when implemented in C. The device is valid C by virtue of two attributes in C:
- Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the body of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.
- The ability to jump into the middle of a loop in C.
This leads to what the Jargon File calls "the most dramatic use yet seen of fall through in C".[5] C's default fall-through in case statements has long been one of its most controversial features; Duff himself said that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."[5]
Although valid in C, Duff's device goes against common C guidelines, such as the MISRA guidelines. Some compilers (e.g. CompCert) are restricted to such guidelines and thus reject Duff's device unless specifically instructed otherwise.
Simplified explanation
[edit]A functionally equivalent version with switch and while disentangled
|
---|
send(to, from, count)
register short *to, *from;
register count;
{
register n = (count + 7) / 8;
switch (count % 8) {
case 0: *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
}
while (--n > 0) {
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
}
}
|
The basic idea of loop unrolling is that the number of instructions executed in a loop can be reduced by reducing the number of loop tests, sometimes reducing the amount of time spent in the loop. For example, in the case of a loop with only a single instruction in the block code, the loop test will typically be performed for every iteration of the loop, that is every time the instruction is executed. If, instead, eight copies of the same instruction are placed in the loop, then the test will be performed only every eight iterations, and this may gain time by avoiding seven tests. However, this only handles a multiple of eight iterations, requiring something else to handle any remainder of iterations.[1]
Duff's device provides a solution by first performing the remainder of iterations, followed by iterating as many times as necessary the multiple of eight similar instructions. To determine the number of remainder iterations, the code first calculates the total number of iterations modulo eight. According to this remainder, the program execution will then jump to a case
statement followed by exactly the number of iterations needed. Once this is done, everything is straightforward: the code continues by doing iterations of groups of eight instructions; this has become possible since the remaining number of iterations is a multiple of eight.[1]
Duff's device provides a compact loop unrolling by using the case keyword both inside and outside the loop. This is unusual because the contents of a case statement are traditionally thought of as a block of code nested inside the case statement, and a reader would typically expect it to end before the next case statement. According to the specifications of C language, this is not necessary; indeed, case statements can appear anywhere inside the switch code block, and at any depth; the program execution will simply jump to the next statement, wherever it may be.
Performance
[edit]Many compilers will optimize the switch into a branch table just as would be done in an assembly implementation.
The primary increase in speed versus a simple, straightforward loop, comes from loop unwinding that reduces the number of performed branches, which are computationally expensive due to the need to flush—and hence stall—the instruction pipeline. The switch
statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, eight short moves are unrolled, so the switch
handles an extra 1–7 shorts automatically).
This automatic handling of the remainder may not be the best solution on all systems and compilers – in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and branch prediction on some architectures.[6] When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance and a noticeable reduction in size of the executable.[7] Therefore, when considering this code as a program optimization, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler, as well as weighing the risk that the optimized code will later be used on different platforms where it is not the fastest code.
For the purpose of memory-to-memory copies (which, as mentioned above, was not the original use of Duff's device), the standard C library provides the function memcpy
;[8] it will not perform worse than a memory-to-memory copy version of this code, and may contain architecture-specific optimizations that make it significantly faster.[9][10]
See also
[edit]- Computed GOTO
- Coroutine – Duff's device can be used to implement coroutines in C/C++ (see Tatham external link)
- Jensen's device
References
[edit]- ^ a b c d Holly, Ralf (August 1, 2005). "A Reusable Duff Device". Dr. Dobb's. Dr. Dobb's Journal. Retrieved September 18, 2015.
- ^ Duff, Tom (August 29, 1988). "Subject: Re: Explanation, please!". Lysator. Retrieved November 3, 2015.
- ^ "The amazing Duff's Device by Tom Duff!". doc.cat-v.org. Retrieved 2017-06-08.
- ^ Cox, Russ (2008-01-30). "research!rsc: On Duff's Device and Coroutines". research.swtch.com. Retrieved 2017-01-24.
- ^ a b The Jargon File
- ^ James Ralston's USENIX 2003 Journal[permanent dead link ]
- ^ Tso, Ted (August 22, 2000). "Re: [PATCH] Re: Move of input drivers, some word needed from you". lkml.indiana.edu. Linux kernel mailing list. Retrieved August 22, 2014.
Jim Gettys has a wonderful explanation of this effect in the X server. It turns out that with branch predictions and the relative speed of CPU vs. memory changing over the past decade, loop unrolling is pretty much pointless. In fact, by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot, because the elimination of all that excess code meant that the X server wasn't thrashing the cache lines as much.
- ^ "memcpy - cppreference.com". En.cppreference.com. Retrieved 2014-03-06.
- ^ Wall, Mike (2002-03-19). "Using Block Prefetch for Optimized Memory Performance" (PDF). mit.edu. Archived from the original (PDF) on 2017-08-30. Retrieved 2012-09-22.
- ^ Fog, Agner (2012-02-29). "Optimizing subroutines in assembly language" (PDF). Copenhagen University College of Engineering. pp. 100 ff. Retrieved 2012-09-22.
Further reading
[edit]- Kernighan, Brian; Ritchie, Dennis (March 1988). The C Programming Language (2nd ed.). Englewood Cliffs, N.J.: Prentice Hall. ISBN 0-13-110362-8.
External links
[edit]- Description and original mail by Duff at Lysator
- Simon Tatham's coroutines in C utilizes the same switch/case trick
- Adam Dunkels's Protothreads - Lightweight, Stackless Threads in C also uses nested switch/case statements (see also The lightest lightweight threads, Protothreads)
- Pigeon's device is a related technique: intertwined switch/case and if/else statements