Jump to content

Hardware acceleration: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Added DPU
Citation bot (talk | contribs)
Added bibcode. | Use this bot. Report bugs. | Suggested by Abductive | Category:Gate arrays | #UCB_Category 10/33
 
(36 intermediate revisions by 29 users not shown)
Line 1: Line 1:
{{For|hardware accelerators in startup
{{For|hardware accelerators in startup companies|Seed accelerator}}
{{short description|Specialized computer hardware}}
companies|Seed accelerator}}
{{short description|Use of specialized computer hardware to perform some functions more efficiently than is possible in software running on a more general-purpose CPU}}

{{Multiple issues|
{{More citations needed|date=September 2014}}
{{More citations needed|date=September 2014}}
{{Overlinked|date=May 2021}}
}}
'''Hardware acceleration''' is the use of [[computer hardware]] made to perform some functions more efficiently than in [[software]] running on a general-purpose [[central processing unit]] (CPU). Any [[function (mathematics)|transformation]] of [[data (computing)|data]] or [[subroutine|routine]] that can be [[Computable function|computed]] can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in [[application-specific integrated circuit]]s (ASICs) designed or programmed to compute the operation than specified in software and performed on a general-purpose [[processor (computing)|computer processor]]. Each approach has advantages and disadvantages. The implementation of [[task (computing)|computing task]]s in hardware to decrease [[Latency (engineering)|latency]] and increase [[Throughput#Integrated Circuits|throughput]] is known as hardware acceleration.


[[File:Sun-crypto-accelerator-1000.jpg|thumb|A [[cryptographic accelerator]] card allows cryptographic operations to be performed at a faster rate.]]
Typical advantages of software include more rapid [[software development process|development]], lower [[non-recurring engineering]] costs, heightened [[software portability|portability]], and ease of [[Software release life cycle|updating features]] or [[patch (computing)|patch]]ing [[software bug|bugs]], at the cost of [[overhead (computing)|overhead]] to [[instruction cycle|compute]] general operations. Advantages of hardware include [[speedup]], reduced [[Electric energy consumption|power consumption]],<ref>{{cite journal|url=https://www.wired.com/2014/06/microsoft-fpga/|title=Microsoft Supercharges Bing Search With Programmable Chips|date=16 June 2014|journal=WIRED}}</ref> lower latency, increased [[Parallel computing|parallelism]]<ref>{{cite web|url=http://www.embedded.com/columns/showArticle.jhtml?articleID=192700615|title=Embedded|archive-url=https://web.archive.org/web/20071008163016/http://www.embedded.com/columns/showArticle.jhtml?articleID=192700615|archive-date=2007-10-08|url-status=dead|access-date=2012-08-18}} "FPGA Architectures from 'A' to 'Z'" by Clive Maxfield 2006</ref> and [[Bandwidth (computing)|bandwidth]], and [[Circuit underutilization|better utilization]] of area and [[Functional unit|functional components]] available on an [[integrated circuit]]; at the cost of lower ability to update designs once [[Semiconductor device fabrication|etched onto silicon]] and higher costs of [[functional verification]], and times to market. In the hierarchy of digital computing systems ranging from general-purpose processors to [[Full custom|fully customized]] hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by [[Computer performance by orders of magnitude|orders of magnitude]] when any given application is implemented higher up that hierarchy.<ref>{{cite web|url=https://en.bitcoin.it/wiki/Mining_hardware_comparison|title=Mining hardware comparison - Bitcoin|access-date=17 July 2014}}</ref><ref>{{cite web|url=https://en.bitcoin.it/wiki/Non-specialized_hardware_comparison|title=Non-specialized hardware comparison - Bitcoin|access-date=25 February 2014}}</ref> This hierarchy includes general-purpose processors such as CPUs,<ref>{{Cite journal|last=Kim|first=Yeongmin|last2=Kong|first2=Joonho|last3=Munir|first3=Arslan|date=2020|title=CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge|url=https://ieeexplore.ieee.org/document/9264125|journal=IEEE Access|volume=8|pages=211422–211433|doi=10.1109/ACCESS.2020.3039278|issn=2169-3536|doi-access=free}}</ref> more specialized processors such as GPUs,<ref>{{Cite journal|last=Lin|first=Yibo|last2=Jiang|first2=Zixuan|last3=Gu|first3=Jiaqi|last4=Li|first4=Wuxi|last5=Dhar|first5=Shounak|last6=Ren|first6=Haoxing|last7=Khailany|first7=Brucek|last8=Pan|first8=David Z.|date=April 2021|title=DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement|url=https://ieeexplore.ieee.org/document/9122053|journal=IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems|volume=40|issue=4|pages=748–761|doi=10.1109/TCAD.2020.3003843|issn=1937-4151}}</ref> [[fixed-function]] implemented on [[field-programmable gate array]]s (FPGAs),<ref>{{Cite journal|last=Lyakhov|first=Pavel|last2=Valueva|first2=Maria|last3=Valuev|first3=Georgii|last4=Nagornov|first4=Nikolai|date=2020-12-18|title=A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units|url=https://www.mdpi.com/2076-3417/10/24/9052|journal=Applied Sciences|language=en|volume=10|issue=24|pages=9052|doi=10.3390/app10249052|issn=2076-3417|quote=Hardware simulation on FPGA increased the digital filter performance.|doi-access=free}}</ref> and fixed-function implemented on ASICs.<ref>{{Cite journal|last=Mohan|first=Prashanth|last2=Wang|first2=Wen|last3=Jungk|first3=Bernhard|last4=Niederhagen|first4=Ruben|last5=Szefer|first5=Jakub|last6=Mai|first6=Ken|date=October 2020|title=ASIC Accelerator in 28 nm for the Post-Quantum Digital Signature Scheme XMSS|url=https://ieeexplore.ieee.org/document/9283605/|journal=2020 IEEE 38th International Conference on Computer Design (ICCD)|location=Hartford, CT, USA|publisher=IEEE|pages=656–662|doi=10.1109/ICCD50377.2020.00112|isbn=978-1-7281-9710-4}}</ref>

'''Hardware acceleration''' is the use of [[computer hardware]] designed to perform specific functions more efficiently when compared to [[software]] running on a general-purpose [[central processing unit]] (CPU). Any [[function (mathematics)|transformation]] of [[data (computing)|data]] that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

To perform computing tasks more efficiently, generally one can invest time and money in improving the software, improving the hardware, or both. There are various approaches with advantages and disadvantages in terms of decreased [[Latency (engineering)|latency]], increased [[Throughput#Integrated Circuits|throughput]], and reduced [[Performance per watt|energy consumption]]. Typical advantages of focusing on software may include greater versatility, more rapid [[software development process|development]], lower [[non-recurring engineering]] costs, heightened [[software portability|portability]], and ease of [[Software release life cycle|updating features]] or [[patch (computing)|patch]]ing [[software bug|bugs]], at the cost of [[overhead (computing)|overhead]] to compute general operations. Advantages of focusing on hardware may include [[speedup]], reduced [[Electric energy consumption|power consumption]],<ref>{{cite magazine|url=https://www.wired.com/2014/06/microsoft-fpga/|title=Microsoft Supercharges Bing Search With Programmable Chips|date=16 June 2014|magazine=WIRED}}</ref> lower latency, increased [[Parallel computing|parallelism]]<ref>{{cite web|url=http://www.embedded.com/columns/showArticle.jhtml?articleID=192700615|title=Embedded|archive-url=https://web.archive.org/web/20071008163016/http://www.embedded.com/columns/showArticle.jhtml?articleID=192700615|archive-date=2007-10-08|url-status=dead|access-date=2012-08-18}} "FPGA Architectures from 'A' to 'Z'" by Clive Maxfield 2006</ref> and [[Bandwidth (computing)|bandwidth]], and [[Circuit underutilization|better utilization]] of area and [[Execution unit|functional components]] available on an [[integrated circuit]]; at the cost of lower ability to update designs once [[Semiconductor device fabrication|etched onto silicon]] and higher costs of [[functional verification]], times to market, and the need for more parts. In the hierarchy of digital computing systems ranging from general-purpose processors to [[Full custom|fully customized]] hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by [[Computer performance by orders of magnitude|orders of magnitude]] when any given application is implemented higher up that hierarchy.<ref>{{Cite book|last1=Sinan|first1=Kufeoglu|last2=Mahmut|first2=Ozkuran|date=2019|title=Energy Consumption of Bitcoin Mining|chapter=Figure 5. CPU, GPU, FPGA, and ASIC minimum energy consumption between difficulty recalculation.|chapter-url=https://www.researchgate.net/figure/CPU-GPU-FPGA-and-ASIC-minimum-energy-consumption-between-difficulty-recalculation_fig5_337886683|doi=10.17863/CAM.41230|doi-access=free}}</ref> This hierarchy includes general-purpose processors such as CPUs,<ref>{{Cite journal|last1=Kim|first1=Yeongmin|last2=Kong|first2=Joonho|last3=Munir|first3=Arslan|date=2020|title=CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge|journal=IEEE Access|volume=8|pages=211422–211433|doi=10.1109/ACCESS.2020.3039278|issn=2169-3536|doi-access=free|bibcode=2020IEEEA...8u1422K }}</ref> more specialized processors such as programmable [[shader]]s in a [[Graphics processing unit|GPU]],<ref>{{Cite journal|last1=Lin|first1=Yibo|last2=Jiang|first2=Zixuan|last3=Gu|first3=Jiaqi|last4=Li|first4=Wuxi|last5=Dhar|first5=Shounak|last6=Ren|first6=Haoxing|last7=Khailany|first7=Brucek|last8=Pan|first8=David Z.|date=April 2021|title=DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement|url=https://ieeexplore.ieee.org/document/9122053|journal=IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems|volume=40|issue=4|pages=748–761|doi=10.1109/TCAD.2020.3003843|s2cid=225744481|issn=1937-4151}}</ref> [[fixed-function]] implemented on [[field-programmable gate array]]s (FPGAs),<ref>{{Cite journal|last1=Lyakhov|first1=Pavel|last2=Valueva|first2=Maria|last3=Valuev|first3=Georgii|last4=Nagornov|first4=Nikolai|date=2020-12-18|title=A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units|journal=Applied Sciences|language=en|volume=10|issue=24|pages=9052|doi=10.3390/app10249052|issn=2076-3417|quote=Hardware simulation on FPGA increased the digital filter performance.|doi-access=free}}</ref> and fixed-function implemented on [[application-specific integrated circuit]]s (ASICs).<ref>{{Cite book|last1=Mohan|first1=Prashanth|last2=Wang|first2=Wen|last3=Jungk|first3=Bernhard|last4=Niederhagen|first4=Ruben|last5=Szefer|first5=Jakub|last6=Mai|first6=Ken|title=2020 IEEE 38th International Conference on Computer Design (ICCD) |chapter=ASIC Accelerator in 28 nm for the Post-Quantum Digital Signature Scheme XMSS |date=October 2020|chapter-url=https://ieeexplore.ieee.org/document/9283605|location=Hartford, CT, USA|publisher=IEEE|pages=656–662|doi=10.1109/ICCD50377.2020.00112|isbn=978-1-7281-9710-4|s2cid=229330964}}</ref>


Hardware acceleration is advantageous for [[Computer performance|performance]], and practical when the functions are fixed so updates are not as needed as in software solutions. With the advent of [[Reconfigurable computing|reprogrammable]] [[Programmable logic device|logic devices]] such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing [[control flow]].<ref name="BingFPGA">{{cite news|url=https://www.enterprisetech.com/2014/09/03/microsoft-using-fpgas-speed-bing-search/|title=How Microsoft Is Using FPGAs To Speed Up Bing Search|last1=Morgan|first1=Timothy Pricket|date=2014-09-03|access-date=2018-09-18|publisher=Enterprise Tech}}</ref><ref name="ProjCatapult">{{cite web|url=https://www.microsoft.com/en-us/research/project/project-catapult/|title=Project Catapult|website=Microsoft Research}}</ref>
Hardware acceleration is advantageous for [[Computer performance|performance]], and practical when the functions are fixed, so updates are not as needed as in software solutions. With the advent of [[Reconfigurable computing|reprogrammable]] [[Programmable logic device|logic devices]] such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing [[control flow]].<ref name="BingFPGA">{{cite news|url=https://www.enterprisetech.com/2014/09/03/microsoft-using-fpgas-speed-bing-search/|title=How Microsoft Is Using FPGAs To Speed Up Bing Search|last1=Morgan|first1=Timothy Pricket|date=2014-09-03|access-date=2018-09-18|publisher=Enterprise Tech}}</ref><ref name="ProjCatapult">{{cite web|url=https://www.microsoft.com/en-us/research/project/project-catapult/|title=Project Catapult|website=Microsoft Research}}</ref> The disadvantage, however, is that in many open source projects, it requires proprietary libraries that not all vendors are keen to distribute or expose, making it difficult to integrate in such projects.


==Overview==
==Overview==
Integrated circuits can be created to perform arbitrary operations on [[Analog signal|analog]] and [[Digital signal|digital]] signals. Most often in computing, signals are digital and can be interpreted as [[binary number]] data. Computer hardware and software operate on information in binary representation to perform [[computing]]; this is accomplished by calculating [[boolean function]]s on the [[bit]]s of input and outputting the result to some [[output device]] downstream for [[Computer data storage|storage]] or [[Pipelining (computing)|further processing]].
[[Integrated circuit|Integrated circuits]] are designed to handle various operations on both analog and digital signals. In computing, digital signals are the most common and are typically represented as binary numbers. [[Computer hardware]] and software use this [[binary representation]] to perform computations. This is done by processing [[Boolean function|Boolean functions]] on the binary input, and then outputting the results for storage or further processing by other devices.


===Computational equivalence of hardware and software===
===Computational equivalence of hardware and software===
Either software or hardware can compute any [[computable function]]. Custom hardware offers higher [[performance per watt]] for the same functions that can be specified in software. [[Hardware description language]]s (HDLs) such as [[Verilog]] and [[VHDL]] can model the same [[Semantics (computer science)|semantics]] as software and [[Logic synthesis|synthesize]] the design into a [[netlist]] that can be programmed to an FPGA or composed into [[logic gate]]s of an ASIC.
Because all [[Turing machine|Turing machines]] can run any [[computable function]], it is always possible to design custom hardware that performs the same function as a given piece of software. Conversely, software can always be used to emulate the function of a given piece of hardware. Custom hardware may offer higher performance per watt for the same functions that can be specified in software. [[Hardware description language]]s (HDLs) such as [[Verilog]] and [[VHDL]] can model the same [[Semantics (computer science)|semantics]] as software and [[Logic synthesis|synthesize]] the design into a [[netlist]] that can be programmed to an FPGA or composed into the [[logic gate]]s of an ASIC.


===Stored-program computers===
===Stored-program computers===
The vast majority of software-based computing occurs on machines implementing the [[von Neumann architecture]], collectively known as [[stored-program computer]]s. [[Computer program]]s are [[stored program|stored as data]] and [[execution (computing)|executed]] by [[processor (computing)|processor]]s, typically one or more [[CPU core]]s. Such processors must [[instruction fetch|fetch]] and [[instruction cycle#Decode step|decode]] instructions [[load/store architecture|as well as data operands]] from [[computer memory|memory]] as part of the [[instruction cycle]] to execute the instructions constituting the software program. Relying on a common [[CPU cache|cache]] for code and data leads to the [[von Neumann bottleneck]], a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the [[modified Harvard architecture]], where instructions and data have separate [[Cache (computing)|caches]] in the [[memory hierarchy]], there is [[overhead (computing)|overhead]] to decoding instruction [[opcode]]s and [[Multiplexer|multiplexing]] available [[execution unit]]s on a [[microprocessor]] or [[microcontroller]], leading to [[circuit underutilization|low circuit utilization]]. Modern processors that provide [[simultaneous multithreading]] exploit under-utilization of available processor functional units and [[instruction level parallelism]] between different [[hardware thread]]s.
The vast majority of software-based computing occurs on machines implementing the [[von Neumann architecture]], collectively known as [[stored-program computer]]s. [[Computer program]]s are stored as data and [[execution (computing)|executed]] by [[processor (computing)|processor]]s. Such processors must fetch and decode instructions, as well as [[Load–store architecture|load data operands]] from [[computer memory|memory]] (as part of the [[instruction cycle]]), to execute the instructions constituting the software program. Relying on a common [[CPU cache|cache]] for code and data leads to the "von Neumann bottleneck", a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the [[modified Harvard architecture]], where instructions and data have separate caches in the [[memory hierarchy]], there is overhead to decoding instruction [[opcode]]s and [[Multiplexer|multiplexing]] available [[execution unit]]s on a [[microprocessor]] or [[microcontroller]], leading to low circuit utilization. Modern processors that provide [[simultaneous multithreading]] exploit under-utilization of available processor functional units and [[Instruction-level parallelism|instruction level parallelism]] between different hardware threads.


===Hardware execution units===
===Hardware execution units===
Hardware [[execution unit]]s do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an [[instruction cycle]] and incur those stages' overhead. If needed calculations are specified in a [[register transfer level]] (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.
Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an [[instruction cycle]] and incur those stages' overhead. If needed calculations are specified in a [[Register-transfer level|register transfer level]] (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.


This reclamation saves time, power and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication or memory, as well as increased [[input/output]] capabilities. This comes at the [[opportunity cost]] of less general-purpose utility.
This reclamation saves time, power, and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication, or memory, as well as increased [[input/output]] capabilities. This comes at the cost of general-purpose utility.


===Emerging hardware architectures===
===Emerging hardware architectures===
Greater RTL customization of hardware designs allows emerging architectures such as [[in-memory processing|in-memory computing]], [[transport triggered architecture]]s (TTA) and [[network on a chip|networks-on-chip]] (NoC) to further benefit from increased [[locality of reference|locality]] of data to execution context, thereby reducing computing and [[communication latency]] between [[modularity|module]]s and functional units.
Greater RTL customization of hardware designs allows emerging architectures such as [[in-memory processing|in-memory computing]], [[transport triggered architecture]]s (TTA) and [[network on a chip|networks-on-chip]] (NoC) to further benefit from increased [[locality of reference|locality]] of data to execution context, thereby reducing computing and communication latency between modules and functional units.


Custom hardware is limited in [[Parallel processing (DSP implementation)|parallel processing]] capability only by the area and [[logic block]]s available on the [[Die (integrated circuit)|integrated circuit die]].<ref>[http://www.xilinx.com/products/design_resources/proc_central/microblaze_faq.pdf MicroBlaze Soft Processor: Frequently Asked Questions] {{webarchive|url=https://web.archive.org/web/20111027074459/http://www.xilinx.com/products/design_resources/proc_central/microblaze_faq.pdf|date=2011-10-27}}</ref> Therefore, hardware is much more free to offer [[Massive parallelism (computing)|massive parallelism]] than software on general-purpose processors, offering a possibility of implementing the [[parallel random-access machine]] (PRAM) model.
Custom hardware is limited in parallel processing capability only by the area and [[logic block]]s available on the [[Die (integrated circuit)|integrated circuit die]].<ref>[http://www.xilinx.com/products/design_resources/proc_central/microblaze_faq.pdf MicroBlaze Soft Processor: Frequently Asked Questions] {{webarchive|url=https://web.archive.org/web/20111027074459/http://www.xilinx.com/products/design_resources/proc_central/microblaze_faq.pdf|date=2011-10-27}}</ref> Therefore, hardware is much more free to offer [[Massively parallel|massive parallelism]] than software on general-purpose processors, offering a possibility of implementing the [[Parallel RAM|parallel random-access machine]] (PRAM) model.


It is common to build [[Multi-core processor|multicore]] and [[Manycore processor|manycore]] processing units out of [[Soft microprocessor|microprocessor IP core schematics]] on a single FPGA or ASIC.<ref>[https://doi.org/10.1007%2FBFb0055278 István Vassányi. "Implementing processor arrays on FPGAs". 1998]</ref><ref>Zhoukun WANG and Omar HAMMAMI. "A 24 Processors System on Chip FPGA Design with Network on Chip". [http://www.design-reuse.com/articles/21583/processor-noc-fpga.html]</ref><ref>[http://members.optusnet.com.au/jekent/Micro16Array/index.html John Kent. "Micro16 Array - A Simple CPU Array"]</ref><ref>Kit Eaton. "1,000 Core CPU Achieved: Your Future Desktop Will Be a Supercomputer". 2011. [http://www.fastcompany.com/1714174/1000-core-cpu-achieved-your-future-desktop-will-be-a-supercomputer?partner=rss]</ref><ref>"Scientists Squeeze Over 1,000 Cores onto One Chip". 2011. [http://www.ecnmag.com/news/2011/01/research/Over-1000-Cores-on-One-Chip.aspx] {{Webarchive|url=https://web.archive.org/web/20120305082424/http://www.ecnmag.com/news/2011/01/research/Over-1000-Cores-on-One-Chip.aspx |date=2012-03-05 }}</ref> Similarly, specialized [[Execution unit|functional unit]]s can be composed in parallel as [[Parallel processing (DSP implementation)|in digital signal processing]] without being embedded in a processor [[Semiconductor intellectual property core|IP core]]. Therefore, hardware acceleration is often employed for repetitive, fixed [[Task (computing)|tasks]] involving little [[Conditional (computer programming)|conditional branching]], especially on large amounts of data. This is how [[Nvidia]]'s [[CUDA]] line of [[Graphics processing unit|GPU]]s are implemented.
It is common to build [[Multi-core processor|multicore]] and [[Manycore processor|manycore]] processing units out of [[Soft microprocessor|microprocessor IP core schematics]] on a single FPGA or ASIC.<ref>{{cite book | chapter-url=https://doi.org/10.1007%2FBFb0055278 | doi=10.1007/BFb0055278 | chapter=Implementing processor arrays on FPGAs | title=Field-Programmable Logic and Applications from FPGAs to Computing Paradigm | series=Lecture Notes in Computer Science | year=1998 | last1=Vassányi | first1=István | volume=1482 | pages=446–450 | isbn=978-3-540-64948-9 }}</ref><ref>Zhoukun WANG and Omar HAMMAMI. "A 24 Processors System on Chip FPGA Design with Network on Chip". [http://www.design-reuse.com/articles/21583/processor-noc-fpga.html]</ref><ref>[http://members.optusnet.com.au/jekent/Micro16Array/index.html John Kent. "Micro16 Array - A Simple CPU Array"]</ref><ref>Kit Eaton. "1,000 Core CPU Achieved: Your Future Desktop Will Be a Supercomputer". 2011. [http://www.fastcompany.com/1714174/1000-core-cpu-achieved-your-future-desktop-will-be-a-supercomputer?partner=rss]</ref><ref>"Scientists Squeeze Over 1,000 Cores onto One Chip". 2011. [http://www.ecnmag.com/news/2011/01/research/Over-1000-Cores-on-One-Chip.aspx] {{Webarchive|url=https://web.archive.org/web/20120305082424/http://www.ecnmag.com/news/2011/01/research/Over-1000-Cores-on-One-Chip.aspx|date=2012-03-05}}</ref> Similarly, specialized functional units can be composed in parallel, as [[Parallel processing (DSP implementation)|in digital signal processing]], without being embedded in a processor [[Semiconductor intellectual property core|IP core]]. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little [[Conditional (computer programming)|conditional branching]], especially on large amounts of data. This is how [[Nvidia]]'s [[CUDA]] line of GPUs are implemented.


===Implementation metrics===
===Implementation metrics===
As device mobility has increased, the relative performance of specific acceleration protocols has required new metricizations, considering the characteristics such as physical hardware dimensions, power consumption and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed.<ref>{{Cite journal|last=Kienle|first=Frank|last2=Wehn|first2=Norbert|last3=Meyr|first3=Heinrich|date=December 2011|title=On Complexity, Energy- and Implementation-Efficiency of Channel Decoders|journal=IEEE Transactions on Communications|language=en-US|volume=59|issue=12|pages=3301–3310|doi=10.1109/tcomm.2011.092011.100157|issn=0090-6778|arxiv=1003.3792}}</ref>
As device mobility has increased, new metrics have been developed that measure the relative performance of specific acceleration protocols, considering characteristics such as physical hardware dimensions, power consumption, and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed.<ref>{{Cite journal|last1=Kienle|first1=Frank|last2=Wehn|first2=Norbert|last3=Meyr|first3=Heinrich|date=December 2011|title=On Complexity, Energy- and Implementation-Efficiency of Channel Decoders|journal=IEEE Transactions on Communications|language=en-US|volume=59|issue=12|pages=3301–3310|doi=10.1109/tcomm.2011.092011.100157|issn=0090-6778|arxiv=1003.3792|s2cid=13863870}}</ref>

==Example tasks accelerated==

===Summing two arrays into a third array===
<syntaxhighlight lang="c">
#include <stdio.h>

int main(void)
{

int arrayOne[] = {1, 2, 3};
int arrayTwo[] = {4, 5, 6};
int arraySum[3];

for (int i = 0; i < 3; i++)
{
arraySum[i] = arrayOne[i] + arrayTwo[i];
}
}
</syntaxhighlight>

===Summing one million integers===
Suppose we wish to compute the sum of <math>2^{20} = 1,048,576</math> [[Integer (computer science)|integers]]. Assuming [[Arbitrary-precision arithmetic|large integers]] are available as <code>bignum</code> [[Precision (computer science)|large enough]] to hold the sum, this can be done in software by specifying (here, in [[C++]]):
<syntaxhighlight lang="c++">
constexpr int N = 20;
constexpr int two_to_the_N = 1 << N;

bignum array_sum(const std::array<int, two_to_the_N>& ints) {
bignum result = 0;
for (std::size_t i = 0; i < two_to_the_N; i++) {
result += ints[i];
}
return result;
}
</syntaxhighlight>

This algorithm runs in [[linear time]], <math display="inline">\mathcal{O}\left(n\right)</math> in [[Big O notation]]. In hardware, with sufficient area on [[Integrated circuit|chip]], calculation can be parallelized to take only 20 [[Clock cycle|time steps]] using the [[prefix sum]] algorithm.<ref name="hs1986">{{cite journal|last1=Hillis|first1=W. Daniel|last2=Steele, Jr.|first2=Guy L.|date=December 1986|title=Data parallel algorithms|journal=Communications of the ACM|volume=29|issue=12|pages=1170–1183|doi=10.1145/7902.7903}}</ref> The algorithm requires only [[logarithmic time]], <math display="inline">\mathcal{O}\left(\log{n}\right)</math>, and <math display="inline">\mathcal{O}\left(1\right)</math> [[Space complexity|space]] as an [[in-place algorithm]]:
<syntaxhighlight lang="systemverilog">
parameter int N = 20;
parameter int two_to_the_N = 1 << N;

function int array_sum;
input int array[two_to_the_N];
begin
for (genvar i = 0; i < N; i++) begin
for (genvar j = 0; j < two_to_the_N; j++) begin
if (j >= (1 << i)) begin
array[j] = array[j] + array[j - (1 << i)];
end
end
end
return array[two_to_the_N - 1];
end
endfunction
</syntaxhighlight>

This example takes advantage of the greater parallel resources available in application-specific hardware than most software and general-purpose [[computing paradigm]]s and [[computer architecture|architecture]]s.

===Stream processing===
{{Expand section|date=October 2018}}

Hardware acceleration can be applied to [[stream processing]].


==Applications==
==Applications==
Examples of hardware acceleration include [[bit blit]] acceleration functionality in [[graphics processing unit]]s (GPUs), use of [[memristor]]s for accelerating [[Artificial neural network|neural networks]] and [[regular expression]] hardware acceleration for [[anti-spam techniques|spam control]] in the [[server (computing)|server]] industry, intended to prevent [[Regular expression Denial of Service|regular expression denial of service]] (ReDoS) attacks.<ref name="wellho">{{cite web
Examples of hardware acceleration include [[bit blit]] acceleration functionality in graphics processing units (GPUs), use of [[memristor]]s for accelerating [[Artificial neural network|neural networks]], and [[regular expression]] hardware acceleration for [[anti-spam techniques|spam control]] in the [[server (computing)|server]] industry, intended to prevent [[ReDoS|regular expression denial of service]] (ReDoS) attacks.<ref name="wellho">{{cite web
| title = Regular Expressions in hardware
| title = Regular Expressions in hardware
| url = http://www.wellho.net/regex/hardware.html
| url = http://www.wellho.net/regex/hardware.html
| access-date = 17 July 2014}}</ref> The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit called a hardware accelerator, though they are usually referred with a more specific term, such as [[3D accelerator]], or [[cryptographic accelerator]].
| access-date = 17 July 2014}}</ref> The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit called a hardware accelerator, though they are usually referred to with a more specific term, such as 3D accelerator, or [[cryptographic accelerator]].


Traditionally, [[processor (computing)|processor]]s were sequential ([[instruction (computing)|instruction]]s are executed one by one), and were designed to run general purpose algorithms controlled by [[instruction fetch]] (for example moving temporary results [[Load/store architecture|to and from]] a [[register file]]). Hardware accelerators improve the execution of a specific algorithm by allowing greater [[concurrency (computer science)|concurrency]], having specific [[datapath]]s for their [[temporary variable]]s, and reducing the overhead of instruction control in the [[fetch-decode-execute cycle]].
Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled by [[instruction fetch]] (for example, moving temporary results [[Load/store architecture|to and from]] a [[register file]]). Hardware accelerators improve the execution of a specific algorithm by allowing greater [[concurrency (computer science)|concurrency]], having specific [[datapath]]s for their [[temporary variable]]s, and reducing the overhead of instruction control in the fetch-decode-execute cycle.


Modern processors are [[Multi-core processor|multi-core]] and often feature parallel "single-instruction; multiple data" ([[SIMD]]) units. Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like [[motion estimation]] in [[MPEG-2]]).
Modern processors are [[Multi-core processor|multi-core]] and often feature parallel "single-instruction; multiple data" ([[Single instruction, multiple data|SIMD]]) units. Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like [[motion estimation]] in [[MPEG-2]]).


==Hardware acceleration units by application==
==Hardware acceleration units by application==
{| class="wikitable"
{| class="wikitable sortable mw-collapsible mw-collapsed"
|+
|+
! Application
! Application
! Hardware accelerator
! Hardware accelerator
! Acronym
! Acronym
|-
|
|
|
|-
|-
| [[Computer graphics]]
| [[Computer graphics]]
* General-purpose tasks
* General-purpose computing
* [[Nvidia]] graphics cards
* GP computing, on Nvidia graphics cards
* [[Ray tracing (graphics)|Ray tracing]]
* [[Ray tracing (graphics)|Ray tracing]]
* [[Video codec]]
| [[Graphics processing unit]]
| Graphics processing unit
* [[General-purpose computing on graphics processing units|General-purpose computing on GPU]]
* [[General-purpose computing on graphics processing units|General-purpose computing on GPU]]
* [[CUDA]] architecture
* [[CUDA]] architecture
* [[Ray-tracing hardware]]
* [[Ray-tracing hardware]]
* Various [[:Category:Video acceleration|video acceleration hardware]]

| GPU
| GPU


Line 130: Line 72:
* CUDA
* CUDA
* RTX
* RTX
* N/A
|-
|
|
|
|-
|-
| [[Digital signal processing]]
| [[Digital signal processing]]
Line 150: Line 97:
* [[System on a chip|on a chip]]
* [[System on a chip|on a chip]]
* [[Transmission Control Protocol|TCP]]
* [[Transmission Control Protocol|TCP]]
* [[Input/output]]
* Input/output
| [[Network processor]] and [[network interface controller]]
| [[Network processor]] and [[network interface controller]]
* [[Network on a chip]]
* [[Network on a chip]]
Line 165: Line 112:
** [[Instruction set architecture|ISA]]
** [[Instruction set architecture|ISA]]
** [[Transport Layer Security|SSL/TLS]]
** [[Transport Layer Security|SSL/TLS]]
* [[Cryptographic attack|Attack]]
* [[Cryptanalysis|Attack]]
* [[Random number generation]]
* [[Random number generation]]
| [[Cryptographic accelerator]] and [[secure cryptoprocessor]]
| [[Cryptographic accelerator]] and [[secure cryptoprocessor]]
* [[Hardware-based encryption]]
* [[Hardware-based encryption]]
** [[AES instruction set]]
** [[AES instruction set]]
** [[SSL acceleration]]
** [[TLS acceleration|SSL acceleration]]
* [[Custom hardware attack]]
* [[Custom hardware attack]]
* [[Hardware random number generator]]
* [[Hardware random number generator]]
Line 190: Line 137:
|-
|-
| [[Multilinear algebra]]
| [[Multilinear algebra]]
| [[Tensor processing unit]]
| [[Tensor Processing Unit|Tensor processing unit]]
| TPU
| TPU
|-
|-
Line 205: Line 152:
| N/A
| N/A
|-
|-
| [[In-memory processing]]
| In-memory processing
| [[Network on a chip]] and [[Systolic array]]
| Network on a chip and [[Systolic array]]
| NoC; N/A
| NoC; N/A
|-
|-
Line 214: Line 161:
|-
|-
| Any computing task
| Any computing task
| [[Computer hardware]]
| Computer hardware


* [[Field-programmable gate array]]s<ref name="Farabet">Farabet, Clément, et al. "[https://www.academia.edu/download/43417932/Hardware_Accelerated_Convolutional_Neura20160306-12343-1rpenxt.pdf Hardware accelerated convolutional neural networks for synthetic vision systems]." Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, 2010.</ref>
* Field-programmable gate arrays<ref name="Farabet">Farabet, Clément, et al. "[https://www.academia.edu/download/43417932/Hardware_Accelerated_Convolutional_Neura20160306-12343-1rpenxt.pdf Hardware accelerated convolutional neural networks for synthetic vision systems]{{dead link|date=July 2022|bot=medic}}{{cbignore|bot=medic}}." Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, 2010.</ref>
* [[Application-specific integrated circuit]]s<ref name="Farabet" />
* Application-specific integrated circuits<ref name="Farabet" />
* [[Complex programmable logic device]]s
* [[Complex programmable logic device]]s
* [[System on a chip|Systems-on-Chip]]
* Systems-on-Chip
** [[Multi-processor system-on-chip]]
** [[Multiprocessor system on a chip|Multi-processor system-on-chip]]
** [[Programmable system-on-chip]]
** [[Programmable system-on-chip]]
| HW (sometimes)
| HW (sometimes)
Line 241: Line 188:
* [[Soft microprocessor]]
* [[Soft microprocessor]]
* [[Flynn's taxonomy]] of parallel [[computer architecture]]s
* [[Flynn's taxonomy]] of parallel [[computer architecture]]s
** [[SIMD|Single instruction, multiple data]] (SIMD)
** [[Single instruction, multiple data]] (SIMD)
** [[Single instruction, multiple threads]] (SIMT)
** [[Single instruction, multiple threads]] (SIMT)
** [[MIMD|Multiple instructions, multiple data]] (MIMD)
** [[Multiple instruction, multiple data|Multiple instructions, multiple data]] (MIMD)
* [[Computer for operations with functions]]
* [[Computer for operations with functions]]


Line 254: Line 201:
{{Hardware acceleration}}
{{Hardware acceleration}}
{{Graphics Processing Unit}}
{{Graphics Processing Unit}}
{{Digital electronics}}


[[Category:Hardware acceleration| ]]
[[Category:Hardware acceleration| ]]
Line 261: Line 209:
[[Category:Gate arrays]]
[[Category:Gate arrays]]
[[Category:Graphics hardware]]
[[Category:Graphics hardware]]
[[Category:Articles with example C code]]

Latest revision as of 21:45, 24 October 2024

A cryptographic accelerator card allows cryptographic operations to be performed at a faster rate.

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

To perform computing tasks more efficiently, generally one can invest time and money in improving the software, improving the hardware, or both. There are various approaches with advantages and disadvantages in terms of decreased latency, increased throughput, and reduced energy consumption. Typical advantages of focusing on software may include greater versatility, more rapid development, lower non-recurring engineering costs, heightened portability, and ease of updating features or patching bugs, at the cost of overhead to compute general operations. Advantages of focusing on hardware may include speedup, reduced power consumption,[1] lower latency, increased parallelism[2] and bandwidth, and better utilization of area and functional components available on an integrated circuit; at the cost of lower ability to update designs once etched onto silicon and higher costs of functional verification, times to market, and the need for more parts. In the hierarchy of digital computing systems ranging from general-purpose processors to fully customized hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by orders of magnitude when any given application is implemented higher up that hierarchy.[3] This hierarchy includes general-purpose processors such as CPUs,[4] more specialized processors such as programmable shaders in a GPU,[5] fixed-function implemented on field-programmable gate arrays (FPGAs),[6] and fixed-function implemented on application-specific integrated circuits (ASICs).[7]

Hardware acceleration is advantageous for performance, and practical when the functions are fixed, so updates are not as needed as in software solutions. With the advent of reprogrammable logic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing control flow.[8][9] The disadvantage, however, is that in many open source projects, it requires proprietary libraries that not all vendors are keen to distribute or expose, making it difficult to integrate in such projects.

Overview

[edit]

Integrated circuits are designed to handle various operations on both analog and digital signals. In computing, digital signals are the most common and are typically represented as binary numbers. Computer hardware and software use this binary representation to perform computations. This is done by processing Boolean functions on the binary input, and then outputting the results for storage or further processing by other devices.

Computational equivalence of hardware and software

[edit]

Because all Turing machines can run any computable function, it is always possible to design custom hardware that performs the same function as a given piece of software. Conversely, software can always be used to emulate the function of a given piece of hardware. Custom hardware may offer higher performance per watt for the same functions that can be specified in software. Hardware description languages (HDLs) such as Verilog and VHDL can model the same semantics as software and synthesize the design into a netlist that can be programmed to an FPGA or composed into the logic gates of an ASIC.

Stored-program computers

[edit]

The vast majority of software-based computing occurs on machines implementing the von Neumann architecture, collectively known as stored-program computers. Computer programs are stored as data and executed by processors. Such processors must fetch and decode instructions, as well as load data operands from memory (as part of the instruction cycle), to execute the instructions constituting the software program. Relying on a common cache for code and data leads to the "von Neumann bottleneck", a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the modified Harvard architecture, where instructions and data have separate caches in the memory hierarchy, there is overhead to decoding instruction opcodes and multiplexing available execution units on a microprocessor or microcontroller, leading to low circuit utilization. Modern processors that provide simultaneous multithreading exploit under-utilization of available processor functional units and instruction level parallelism between different hardware threads.

Hardware execution units

[edit]

Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an instruction cycle and incur those stages' overhead. If needed calculations are specified in a register transfer level (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.

This reclamation saves time, power, and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication, or memory, as well as increased input/output capabilities. This comes at the cost of general-purpose utility.

Emerging hardware architectures

[edit]

Greater RTL customization of hardware designs allows emerging architectures such as in-memory computing, transport triggered architectures (TTA) and networks-on-chip (NoC) to further benefit from increased locality of data to execution context, thereby reducing computing and communication latency between modules and functional units.

Custom hardware is limited in parallel processing capability only by the area and logic blocks available on the integrated circuit die.[10] Therefore, hardware is much more free to offer massive parallelism than software on general-purpose processors, offering a possibility of implementing the parallel random-access machine (PRAM) model.

It is common to build multicore and manycore processing units out of microprocessor IP core schematics on a single FPGA or ASIC.[11][12][13][14][15] Similarly, specialized functional units can be composed in parallel, as in digital signal processing, without being embedded in a processor IP core. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little conditional branching, especially on large amounts of data. This is how Nvidia's CUDA line of GPUs are implemented.

Implementation metrics

[edit]

As device mobility has increased, new metrics have been developed that measure the relative performance of specific acceleration protocols, considering characteristics such as physical hardware dimensions, power consumption, and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed.[16]

Applications

[edit]

Examples of hardware acceleration include bit blit acceleration functionality in graphics processing units (GPUs), use of memristors for accelerating neural networks, and regular expression hardware acceleration for spam control in the server industry, intended to prevent regular expression denial of service (ReDoS) attacks.[17] The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit called a hardware accelerator, though they are usually referred to with a more specific term, such as 3D accelerator, or cryptographic accelerator.

Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled by instruction fetch (for example, moving temporary results to and from a register file). Hardware accelerators improve the execution of a specific algorithm by allowing greater concurrency, having specific datapaths for their temporary variables, and reducing the overhead of instruction control in the fetch-decode-execute cycle.

Modern processors are multi-core and often feature parallel "single-instruction; multiple data" (SIMD) units. Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like motion estimation in MPEG-2).

Hardware acceleration units by application

[edit]
Application Hardware accelerator Acronym
Computer graphics Graphics processing unit GPU
  • GPGPU
  • CUDA
  • RTX
  • N/A
Digital signal processing Digital signal processor DSP
Analog signal processing Field-programmable analog array FPAA
  • FPRF
Sound processing Sound card and sound card mixer N/A
Computer networking Network processor and network interface controller NPU and NIC
  • NoC
  • TCPOE or TOE
  • I/OAT or IOAT
Cryptography Cryptographic accelerator and secure cryptoprocessor N/A
Artificial intelligence AI accelerator N/A
  • VPU
  • PNN
  • N/A
Multilinear algebra Tensor processing unit TPU
Physics simulation Physics processing unit PPU
Regular expressions[17] Regular expression coprocessor N/A
Data compression[18] Data compression accelerator N/A
In-memory processing Network on a chip and Systolic array NoC; N/A
Data processing Data processing unit DPU
Any computing task Computer hardware HW (sometimes)
  • FPGA
  • ASIC
  • CPLD
  • SoC
    • MPSoC
    • PSoC

See also

[edit]

References

[edit]
  1. ^ "Microsoft Supercharges Bing Search With Programmable Chips". WIRED. 16 June 2014.
  2. ^ "Embedded". Archived from the original on 2007-10-08. Retrieved 2012-08-18. "FPGA Architectures from 'A' to 'Z'" by Clive Maxfield 2006
  3. ^ Sinan, Kufeoglu; Mahmut, Ozkuran (2019). "Figure 5. CPU, GPU, FPGA, and ASIC minimum energy consumption between difficulty recalculation.". Energy Consumption of Bitcoin Mining. doi:10.17863/CAM.41230.
  4. ^ Kim, Yeongmin; Kong, Joonho; Munir, Arslan (2020). "CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge". IEEE Access. 8: 211422–211433. Bibcode:2020IEEEA...8u1422K. doi:10.1109/ACCESS.2020.3039278. ISSN 2169-3536.
  5. ^ Lin, Yibo; Jiang, Zixuan; Gu, Jiaqi; Li, Wuxi; Dhar, Shounak; Ren, Haoxing; Khailany, Brucek; Pan, David Z. (April 2021). "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 40 (4): 748–761. doi:10.1109/TCAD.2020.3003843. ISSN 1937-4151. S2CID 225744481.
  6. ^ Lyakhov, Pavel; Valueva, Maria; Valuev, Georgii; Nagornov, Nikolai (2020-12-18). "A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units". Applied Sciences. 10 (24): 9052. doi:10.3390/app10249052. ISSN 2076-3417. Hardware simulation on FPGA increased the digital filter performance.
  7. ^ Mohan, Prashanth; Wang, Wen; Jungk, Bernhard; Niederhagen, Ruben; Szefer, Jakub; Mai, Ken (October 2020). "ASIC Accelerator in 28 nm for the Post-Quantum Digital Signature Scheme XMSS". 2020 IEEE 38th International Conference on Computer Design (ICCD). Hartford, CT, USA: IEEE. pp. 656–662. doi:10.1109/ICCD50377.2020.00112. ISBN 978-1-7281-9710-4. S2CID 229330964.
  8. ^ Morgan, Timothy Pricket (2014-09-03). "How Microsoft Is Using FPGAs To Speed Up Bing Search". Enterprise Tech. Retrieved 2018-09-18.
  9. ^ "Project Catapult". Microsoft Research.
  10. ^ MicroBlaze Soft Processor: Frequently Asked Questions Archived 2011-10-27 at the Wayback Machine
  11. ^ Vassányi, István (1998). "Implementing processor arrays on FPGAs". Field-Programmable Logic and Applications from FPGAs to Computing Paradigm. Lecture Notes in Computer Science. Vol. 1482. pp. 446–450. doi:10.1007/BFb0055278. ISBN 978-3-540-64948-9.
  12. ^ Zhoukun WANG and Omar HAMMAMI. "A 24 Processors System on Chip FPGA Design with Network on Chip". [1]
  13. ^ John Kent. "Micro16 Array - A Simple CPU Array"
  14. ^ Kit Eaton. "1,000 Core CPU Achieved: Your Future Desktop Will Be a Supercomputer". 2011. [2]
  15. ^ "Scientists Squeeze Over 1,000 Cores onto One Chip". 2011. [3] Archived 2012-03-05 at the Wayback Machine
  16. ^ Kienle, Frank; Wehn, Norbert; Meyr, Heinrich (December 2011). "On Complexity, Energy- and Implementation-Efficiency of Channel Decoders". IEEE Transactions on Communications. 59 (12): 3301–3310. arXiv:1003.3792. doi:10.1109/tcomm.2011.092011.100157. ISSN 0090-6778. S2CID 13863870.
  17. ^ a b "Regular Expressions in hardware". Retrieved 17 July 2014.
  18. ^ "Compression Accelerators - Microsoft Research". Microsoft Research. Retrieved 2017-10-07.
  19. ^ a b Farabet, Clément, et al. "Hardware accelerated convolutional neural networks for synthetic vision systems[dead link]." Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, 2010.
[edit]