Kepler (microarchitecture): Difference between revisions
Cretman121 (talk | contribs) No edit summary |
No edit summary Tags: Mobile edit Mobile web edit Advanced mobile edit |
||
(176 intermediate revisions by more than 100 users not shown) | |||
Line 1: | Line 1: | ||
{{short description|GPU microarchitecture by Nvidia}} |
|||
<!--- Don't mess with this line! --->{{Unreviewed}} |
|||
{{Infobox GPU microarchitecture |
|||
<!--- Nvidia Kepler Architecture. ---> |
|||
| name = Kepler |
|||
<!--- Write your article below this line ---> |
|||
| image = |
|||
Nvidia Kepler Architecture is the first Nvidia GPU architecture to focuse on energy efficiency. |
|||
| caption = |
|||
| alt = |
|||
| launched = {{start date|2012|04|03}} |
|||
| discontinued = |
|||
| soldby = [[Nvidia]] |
|||
| designfirm = [[Nvidia]] |
|||
| manuf1 = [[TSMC]] |
|||
| process = [[TSMC]] [[32 nm process|28 nm]] |
|||
| codename = <!-- Official codename for the GPU microarchitecture --> |
|||
<!------------------ Product Series -------------------> |
|||
| products-desktop1 = [[GeForce 600 series]] <br /> [[GeForce 700 series]] |
|||
| products-hedt1 = [[Quadro|Quadro K]] |
|||
| products-server1 = [[Nvidia Tesla|Tesla K]] |
|||
<!------------------ Supported Graphics APIs -------------------> |
|||
| directx-version = [[DirectX#DirectX 12 Ultimate|DirectX 12 Ultimate (Feature Level 11_0)]] |
|||
| direct3d-version = <!-- Version number of Direct3D supported by the GPU architecture --> |
|||
| shadermodel-version = [[High-Level Shader Language|Shader Model 6.5]] |
|||
| opencl-version = <!-- Version number of OpenCL supported by the GPU architecture --> |
|||
| opengl-version = <!-- Version number of OpenGL supported by the GPU architecture --> |
|||
| opengles-version = <!-- Version number of OpenGL-ES supported by the GPU architecture --> |
|||
| cuda-version = <!-- Version number of CUDA supported by the GPU architecture --> |
|||
| optix-version = <!-- Version number of OptiX supported by the GPU architecture --> |
|||
| mantle-api = <!-- Version number of Mantle supported by the GPU architecture --> |
|||
| vulkan-api = [[Vulkan#Vulkan 1.2|Vulkan 1.2]] |
|||
<!------------------ Supported Compute APIs -------------------> |
|||
| opengl-compute-version = <!-- Version number of OpenGL Compute supported by the GPU architecture --> |
|||
| cuda-compute-version = <!-- Version number of CUDA Compute supported by the GPU architecture --> |
|||
| directcompute-version = <!-- Version number of DirectCompute supported by the GPU architecture --> |
|||
<!------------------ Specifications -------------------> |
|||
| compute = <!-- Peak compute level in TFLOPS --> |
|||
| slowest = <!-- Base clock rate number --> |
|||
| slow-unit = <!-- Base clock rate unit, e.g. MHz or GHz --> |
|||
| fastest = <!-- Peak clock rate number --> |
|||
| fast-unit = <!-- Peak clock rate unit, e.g. MHz or GHz --> |
|||
| shader-clock = <!-- Clock rate that the shader engine operates at --> |
|||
| l0-cache = <!-- Amount of L0 cache (per SM/compute unit/execution unit) --> |
|||
| l1-cache = 16{{nbsp}}KB (per SM) |
|||
| l2-cache = Up to 512{{nbsp}}KB |
|||
| l3-cache = <!-- Amount of L3 cache --> |
|||
| memory-support = [[GDDR5 SDRAM|GDDR5]] |
|||
| memory-clock = <!-- Clock rate for GPU memory --> |
|||
| pcie-support = [[PCI Express#PCI Express 2.0|PCIe 2.0]] <br /> [[PCI Express#PCI Express 3.0|PCIe 3.0]] |
|||
<!------------------ Media Engine -------------------> |
|||
| encode-codec = [[Advanced Video Coding|H.264]] |
|||
| decode-codec = {{hlist|[[Advanced Video Coding|H.264]]|[[High Efficiency Video Coding|H.265]]}} |
|||
| color-depth = <!-- Supported color depth for encoding, e.g. 8-bit, 10-bit, 12-bit --> |
|||
| encoders = [[NVENC]] |
|||
| display-outputs = [[Digital Visual Interface|DVI]] <br /> [[DisplayPort#1.2|DisplayPort 1.2]] <br /> [[HDMI#Version 1.4|HDMI 1.4a]] |
|||
<!------------------ History -------------------> |
|||
| predecessor = [[Fermi (microarchitecture)|Fermi]] |
|||
| variant = <!-- Variant of the GPU architecture --> |
|||
| successor = [[Maxwell (microarchitecture)|Maxwell]] |
|||
| support status = Unsupported |
|||
}} |
|||
[[File:JKepler.jpg|right|thumb|Portrait of Johannes Kepler, eponym of architecture]] |
|||
'''Kepler''' is the codename for a [[GPU]] [[microarchitecture]] developed by [[Nvidia]], first introduced at retail in April 2012,<ref>{{cite web |last1=Mujtaba |first1=Hassan |date=18 February 2012 |title=Nvidia Expected to launch Eight New 28nm Kepler GPU's in April 2012 |url=http://wccftech.com/nvidia-expected-launch-28nm-kepler-gpus-april-2012/}}</ref> as the successor to the [[Fermi (microarchitecture)|Fermi]] microarchitecture. Kepler was Nvidia's first microarchitecture to focus on [[Efficient energy use|energy efficiency]]. Most [[GeForce 600 series]], most [[GeForce 700 series]], and some [[GeForce 800M series]] GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the [[Tegra K1]] [[System on a chip|SoC]], and in the [[Nvidia Quadro|Quadro]] Kxxx series, the Quadro NVS 510, and [[Nvidia Tesla|Tesla]] computing modules. |
|||
Kepler was followed by the [[Maxwell (microarchitecture)|Maxwell]] microarchitecture and used alongside Maxwell in the [[GeForce 700 series]] and [[GeForce 800M series]]. |
|||
The architecture is named after [[Johannes Kepler]], a German mathematician and key figure in the 17th century [[scientific revolution]]. |
|||
== Overview == |
== Overview == |
||
[[File:NVIDIA@28nm@Kepler@GK110 A1@GeForce GTX Titan@1251A1 NFF528.MOW GK110-400-A1 Stack-DSC04727-DSC04758 - ZS-retouched-1 (26914831573).jpg|thumb|Die shot of a GK110 A1 GPU, found inside GeForce GTX Titan cards]] |
|||
Where the goal of the previous architecture, Fermi, was to increase raw performance (particularly for compute and tessellation), Nvidia's goal with the Kepler architecture was to increase performance per watt, while still striving for overall performance increases.<ref name=gtx680-nvidia-paper /> The primary way they achieved this goal was through the use of a unified clock. By abandoning the shader clock found in their previous GPU designs, efficiency is increased, even though it requires more cores to achieve similar levels of performance. This is not only because the cores are more power efficient (two Kepler cores using about 90% of the power of one Fermi core, according to Nvidia's numbers), but also because the reduction in clock speed delivers a 50% reduction in power consumption in that area.<ref name="anandtech-GTX680-review">{{cite web|last=Smith|first=Ryan|title=NVIDIA GeForce GTX 680 Review: Retaking The Performance Crown|date=March 22, 2012|url=http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3|work=AnandTech|accessdate=November 25, 2012}}</ref> |
|||
The goal of Nvidia's previous architecture was design focused on increasing performance on compute and tessellation. With the Kepler architecture, Nvidia targeted their focus on efficiency, programmability, and performance.<ref>{{cite web |title=Inside Kepler |url=http://on-demand.gputechconf.com/gtc/2012/presentations/S0642-GTC2012-Inside-Kepler.pdf |language=en-US |access-date=2015-09-19}}</ref><ref name=gtx680-nvidia-paper>{{cite web |title=Introducing The GeForce GTX 680 GPU |url=http://www.geforce.com/whats-new/articles/introducing-the-geforce-gtx-680-gpu/#kepler-architecture |website=Nvidia |language=en-US |date=March 22, 2012 |access-date=2015-09-19}}</ref> The efficiency aim was achieved through the use of a unified GPU clock, simplified static scheduling of instruction and higher emphasis on performance per watt.<ref>{{Cite web |title=Nvidia's Next Generation CUDA Compute Architecture: Kepler TM GK110 |url=https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf |website=Nvidia |language=en-US}}</ref> By abandoning the shader clock found in their previous GPU designs, efficiency is increased, even though it requires additional cores to achieve higher levels of performance. This is not only because the cores are more power-friendly (two Kepler cores using 90% power of one Fermi core, according to Nvidia's numbers), but also the change to a unified GPU clock scheme delivers a 50% reduction in power consumption in that area.<ref name="anandtech-GTX680-review">{{cite web |last=Smith |first=Ryan |date=March 22, 2012 |title=Nvidia GeForce GTX 680 Review: Retaking The Performance Crown |url=http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3 |work=AnandTech |language=en-US |access-date=November 25, 2012}}</ref> |
|||
Programmability aim was achieved with Kepler's Hyper-Q, Dynamic Parallelism and multiple new Compute Capabilities 3.x functionality. With it, higher GPU utilization and simplified code management was achievable with GK GPUs thus enabling more flexibility in programming for Kepler GPUs.<ref>{{cite web |title=Efficiency Through Hyper-Q, Dynamic Parallelism, & More |url=http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/4 |website=Nvidia |language=en-US |date=November 12, 2012 |access-date=2015-09-19}}</ref> |
|||
Kepler also introduced a new form of texture handling known as bindless textures. Previously, textures needed to be bound by the CPU to a particular slot in a fixed-size table before the GPU could reference them. This led to two limitations: one was that because the table was fixed in size, there could only be as many textures in use at one time as could fit in this table (128). The second was that the CPU was doing unnecessary work: it had to load each texture, and also bind each texture loaded in memory to a slot in the binding table.<ref name=gtx680-nvidia-paper /> With bindless textures, both limitations are removed. The GPU can access any texture loaded into memory, increasing the number of available textures and removing the performance penalty of binding. |
|||
Finally with the performance aim, additional execution resources (more CUDA cores, registers and cache) and with Kepler's ability to achieve a memory clock speed of 7 GHz, increases Kepler's performance when compared to previous Nvidia GPUs.<ref name=anandtech-GTX680-review /><ref>{{Cite web |title=GeForce GTX 770 {{!}} Specifications {{!}} GeForce |url=https://www.nvidia.com/en-us/geforce/graphics-cards/geforce-gtx-770/specifications/ |website=Nvidia |access-date=2022-06-07}}</ref> |
|||
Finally, with Kepler, Nvidia was able to increase the memory clock to 6 GHz. To accomplish this, they needed to design an entirely new memory controller and bus. While still shy of the theoretical 7 GHz limitation of GDDR5, this is well above the 4 GHz speed of the memory controller for Fermi.<ref name=anandtech-GTX680-review /> |
|||
== Features == |
== Features == |
||
The |
The GK Series GPU contains features from both the older Fermi and newer Kepler generations. Kepler based members add the following standard features: |
||
* [[PCI Express#PCI Express 3.0|PCI Express 3.0]] interface |
* [[PCI Express#PCI Express 3.0|PCI Express 3.0]] interface |
||
* [[DisplayPort]] 1.2 |
* [[DisplayPort]] 1.2 |
||
* [[HDMI]] 1.4a 4K x 2K video output |
* [[HDMI]] 1.4a 4K x 2K video output |
||
* [[Purevideo| |
* [[Purevideo|PureVideo VP5]] hardware video acceleration (up to 4K x 2K H.264 decode) |
||
* Hardware [[H.265]] decoding<ref>{{cite web | url=https://bluesky-soft.com/en/dxvac/deviceInfo/decoder/nvidia.html | title=NVIDIA GPU Decoder Device Information }}</ref> |
|||
* Hardware H.264 encoding acceleration block (NVENC) |
|||
* Hardware [[H.264]] encoding acceleration block (NVENC) |
|||
* Support for up to 4 independent 2D displays, or 3 stereoscopic/3D displays (NV Surround) |
* Support for up to 4 independent 2D displays, or 3 stereoscopic/3D displays (NV Surround) |
||
* Next Generation Streaming Multiprocessor (SMX) |
* Next Generation Streaming Multiprocessor (SMX) |
||
* Polymorph-Engine 2.0 |
|||
* A New Instruction Scheduler |
|||
* Simplified Instruction Scheduler |
|||
* Bindless Textures |
* Bindless Textures |
||
* [[CUDA]] Compute Capability 3.0 |
* [[CUDA]] Compute Capability 3.0 to 3.5 |
||
* GPU Boost |
* GPU Boost (Upgraded to 2.0 on GK110) |
||
* TXAA Support |
* TXAA Support |
||
* Manufactured by [[TSMC]] on a 28 nm process |
* Manufactured by [[TSMC]] on a 28 nm process |
||
* New Shuffle Instructions |
|||
* Dynamic Parallelism |
|||
* Hyper-Q (Hyper-Q's MPI functionality reserve for Tesla only) |
|||
* Grid Management Unit |
|||
* Nvidia GPUDirect (GPU Direct's RDMA functionality reserve for Tesla only) |
|||
=== Next Generation Streaming Multiprocessor (SMX) === |
=== {{Anchor|SMX}}Next Generation Streaming Multiprocessor (SMX) === |
||
[[File:NVIDIA GeForce GTX 780 PCB-Front.jpg|thumb|GTX 780 PCB and die - A later revision of Kepler with more similarities to the GK110 than the initial 680.]] |
|||
Kepler employs a new streaming multiprocessor architecture called SMX. CUDA execution core counts were increased from 32 per each of 16 SMs to 192 per each of 8 SMX; the register file was only doubled per SMX to 65,536 x 32-bit for an overall lower ratio; between this and other compromises, despite the 3x overall increase in CUDA cores and clock increase (on the 680 vs. the Fermi 580), the actual performance gains in most operations were well under 3x. Dedicated FP64 CUDA cores are used rather than treating two FP32 cores as a single unit as was done previously, and very few were included on the consumer models resulting in 1/24th speed FP64 calculation compared to FP32.<ref>{{cite web |title=GeForce 680 (Kepler) Whitepaper |url=https://www.nvidia.com/content/pdf/product-specifications/geforce_gtx_680_whitepaper_final.pdf |website=Nvidia |access-date=March 22, 2024}}</ref> |
|||
On the HPC models, the GK110/210, the SMX count was raised to 13-15 depending on the product, and more FP64 cores were included to bring the compute ratio up to 1/3rd FP32. On the GK110, per-thread register limit was quadrupled over fermi to 255, but this still only allows a thread using half of the registers to parallelize to 1/4 of each SMX. The GK210 (released at the same time) increased the register limit to 512 to improve performance in high register pressure situations like this. Texture cache, which programmers had already been using for compute as a read-only buffer in previous generations, was increased in size and the data path optimized for faster throughput when using this method. All levels of memory including the register file are single-bit ECC as well. |
|||
The Kepler architecture employs a new Streaming Multiprocessor Architecture called SMX. The SMX are the key method for Kepler's power efficiency as the whole GPU uses a single "Core Clock" rather than the double-pump "Shader Clock".<ref name=anandtech-GTX680-review /> Although the SMX usage of a single unified clock increases the GPU power efficiency due to the fact that one Kepler CUDA Cores consume 90% power of two Fermi CUDA Core, consequently the SMX needs additional processing units to execute a whole warp per cycle. Kepler also needed to increase raw GPU performance as to remain competitive. As a result, it doubled the CUDA Cores from 16 to 32 per CUDA array, 3 CUDA Cores Array to 6 CUDA Cores Array, 1 load/store and 1 SFU group to 2 load/store and 2 SFU group. The GPU processing resources are also double. From 2 warp schedulers to 4 warp schedulers, 4 dispatch unit became 8 and the register file doubled to 64K entries as to increase performance. With the doubling of GPU processing units and resources increasing the usage of die spaces, The capability of the PolyMorph Engine aren't double but enhanced, making it capable of spurring out a polygon in 2 cycles instead of 4.<ref>{{cite web | url=http://www.tomshardware.com/reviews/geforce-gtx-680-review-benchmark,3161-2.html| title=GK104: The Chip And Architecture GK104: The Chip And Architecture | date=March 22, 2012 | publisher=Tom;s Hardware}}</ref> With Kepler, Nvidia not only have to work on power efficiency but also on area efficiency, thus Nvidia opted to use 8 dedicated FP64 CUDA cores in a SMX as to save die space while still offering FP64 capabilities since all Kepler CUDA cores are not FP64 capable. With the improvement Nvidia made on Kepler, the results include an increase in GPU graphic performance while downplaying FP64 performance. |
|||
Another notable feature is that while Fermi GPUs could only be accessed by one CPU thread at a time, the HPC Kepler GPUs added multithreading support so high core count processors could open 32 connections and more easily saturate the compute capability.<ref>{{cite web |title=Nvidia Kepler GK210/110 Architecture White Paper |url=https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf |website=Nvidia |language=en-US |access-date=22 March 2024}}</ref> |
|||
=== A New Instruction Scheduler === |
|||
Additional die areas are acquired by replacing the complex hardware scheduler with a simple software scheduler. With software scheduling, warps scheduling was moved to Nvidia's compiler and as the GPU math pipeline now has a fixed latency, it now include the utilization of Instruction-Level Parallelism and superscalar execution in addition to Thread-Level Parallelism. As instructions are statically scheduled, scheduling inside a warp becomes redundant since the latency of the math pipeline is already known. This resulted an increase in die area space and power efficiency.<ref name=anandtech-GTX680-review /> <ref>{{cite web | url=http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf| title=NVIDIA Kepler GK110 Architecture Whitepaper}}</ref> <ref name=gtx680-nvidia-paper /> |
|||
=== Simplified Instruction Scheduler === |
|||
Additional die space reduction and power saving was achieved by removing a complex hardware block that handled the prevention of data hazards.<ref name=gtx680-nvidia-paper /><ref name=anandtech-GTX680-review /><ref name=anandtech-GK110-preview /><ref name="nvidia">{{cite web |title=Nvidia Kepler GK110 Architecture Whitepaper |url=https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf |website=Nvidia |language=en-US |access-date=2015-09-19}}</ref> |
|||
=== GPU Boost === |
=== GPU Boost === |
||
GPU Boost is a new feature which is roughly analogous to turbo boosting of a CPU. The GPU is always guaranteed to run at a minimum clock speed, referred to as the "base clock". This clock speed is set to the level which will ensure that the GPU stays within [[Thermal design power|TDP]] specifications, even at maximum loads.<ref name=gtx680-nvidia-paper /> When loads are lower, however, there is room for the clock speed to be increased without exceeding the TDP. In these scenarios, GPU Boost will gradually increase the clock speed in steps, until the GPU reaches a predefined power target |
GPU Boost is a new feature which is roughly analogous to turbo boosting of a CPU. The GPU is always guaranteed to run at a minimum clock speed, referred to as the "base clock". This clock speed is set to the level which will ensure that the GPU stays within [[Thermal design power|TDP]] specifications, even at maximum loads.<ref name=gtx680-nvidia-paper /> When loads are lower, however, there is room for the clock speed to be increased without exceeding the TDP. In these scenarios, GPU Boost will gradually increase the clock speed in steps, until the GPU reaches a predefined power target of 170W by default (on the 680 card).<ref name=anandtech-GTX680-review /> By taking this approach, the GPU will ramp its clock up or down dynamically, so that it is providing the maximum amount of speed possible while remaining within TDP specifications. |
||
The power target, as well as the size of the clock increase steps that the GPU will take, are both adjustable via third-party utilities and provide a means of overclocking Kepler-based cards.<ref name=gtx680-nvidia-paper /> |
The power target, as well as the size of the clock increase steps that the GPU will take, are both adjustable via third-party utilities and provide a means of overclocking Kepler-based cards.<ref name=gtx680-nvidia-paper /> |
||
=== Microsoft Direct3D Support === |
=== Microsoft Direct3D Support === |
||
Nvidia Fermi and Kepler GPUs |
Nvidia Fermi and Kepler GPUs in the GeForce 600 series support the Direct3D 11.0 specification. Nvidia originally stated that the Kepler architecture has full [[DirectX]] 11.1 support, which includes the Direct3D 11.1 path.<ref>{{cite web |title=Nvidia Launches First GeForce GPUs Based on Next-Generation Kepler Architecture |url=http://nvidianews.nvidia.com/Releases/NVIDIA-Launches-First-GeForce-GPUs-Based-on-Next-Generation-Kepler-Architecture-79b.aspx |website=Nvidia |language=en-US |date=March 22, 2012 |url-status=dead |archive-url=https://web.archive.org/web/20130614205336/http://nvidianews.nvidia.com/Releases/NVIDIA-Launches-First-GeForce-GPUs-Based-on-Next-Generation-Kepler-Architecture-79b.aspx |archive-date=June 14, 2013 }}</ref> The following "Modern UI" Direct3D 11.1 features, however, are not supported:<ref>{{cite web |last=Edward |first=James |date=November 22, 2012 |title=Nvidia claims partially support DirectX 11.1 |url=http://technewspedia.com/nvidia-claims-partially-support-directx-11-1/ |website=TechNews |language=en-US |access-date=2015-09-19 |archive-url=https://web.archive.org/web/20150628213421/http://technewspedia.com/nvidia-claims-partially-support-directx-11-1/ |archive-date=June 28, 2015 | url-status=dead }}</ref><ref name="Nvidia/D3D11.1">{{cite web|url=http://www.brightsideofnews.com/news/2012/11/21/nvidia-doesnt-fully-support-directx-111-with-kepler-gpus2c-bute280a6.aspx |title=Nvidia Doesn't Fully Support DirectX 11.1 with Kepler GPUs, But… (Web Archive Link) |publisher=BSN |url-status=dead |archive-url=https://web.archive.org/web/20121229062851/http://www.brightsideofnews.com/news/2012/11/21/nvidia-doesnt-fully-support-directx-111-with-kepler-gpus2c-bute280a6.aspx |archive-date=December 29, 2012 }}</ref> |
||
* Target-Independent Rasterization (2D rendering only). |
* Target-Independent Rasterization (2D rendering only). |
||
Line 48: | Line 128: | ||
* UAV (Unordered Access View) in non-pixel-shader stages. |
* UAV (Unordered Access View) in non-pixel-shader stages. |
||
According to the definition by Microsoft, Direct3D |
According to the definition by Microsoft, [[Direct3D feature level]] 11_1 must be complete, otherwise the Direct3D 11.1 path can not be executed.<ref>{{cite web | url=https://msdn.microsoft.com/en-us/library/windows/desktop/ff476329%28v=vs.85%29.aspx | title=D3D_FEATURE_LEVEL enumeration (Windows) | publisher=MSDN |access-date=2015-09-19}}</ref> |
||
The integrated Direct3D features of the Kepler architecture are the same as those of the GeForce 400 series Fermi architecture.<ref name="Nvidia/D3D11.1"/> |
The integrated Direct3D features of the Kepler architecture are the same as those of the GeForce 400 series Fermi architecture.<ref name="Nvidia/D3D11.1"/> |
||
=== |
=== Next Microsoft Direct3D Support === |
||
Nvidia Kepler GPUs of the GeForce 600/700 series support Direct3D 12 feature level 11_0.<ref>{{cite web |last=Moreton |first=Henry |date=March 20, 2014 |title=DirectX 12: A Major Stride for Gaming |url=http://blogs.nvidia.com/blog/2014/03/20/directx-12/ |website=Nvidia |language=en-US |access-date=2015-09-19}}</ref> |
|||
Exclusive to Kepler GPUs, TXAA is a new anti-aliasing method from Nvidia that is designed for direct implementation into game engines. TXAA is based on the [[Multisample anti-aliasing|MSAA]] technique and custom resolve filters. It is design to addresses a key problem in games known as shimmering or [[temporal aliasing]]. TXAA resolves that by smoothing out the scene in motion, making sure that any in-game scene is being cleared of any aliasing and shimmering.<ref>{{cite web | url= http://www.geforce.com/whats-new/articles/introducing-the-geforce-gtx-680-gpu/#kepler-architecture| title= Introducing The GeForce GTX 680 GPU| date=March 22, 2012 | publisher=Nvidia}}</ref> |
|||
=== |
=== TXAA Support === |
||
Exclusive to Kepler GPUs, TXAA is a new anti-aliasing method from Nvidia that is designed for direct implementation into game engines. TXAA is based on the [[Multisample anti-aliasing|MSAA]] technique and custom resolve filters. It is designed to address a key problem in games known as shimmering or [[temporal aliasing]]. TXAA resolves that by smoothing out the scene in motion, making sure that any in-game scene is being cleared of any aliasing and shimmering.<ref name="gtx680-nvidia-paper"/> |
|||
NVENC is Nvidia's power efficient fixed-function encode that is able to take codecs, decode, preprocess, and encode H.264-based content. NVENC specification input formats are limited to H.264 output. But still, NVENC, through its limited format, can support up to 4096x4096 encode.<ref name="Tom’s Hardware">{{cite web | url=http://www.tomshardware.com/reviews/geforce-gtx-680-review-benchmark,3161-16.html| title=Benchmark Results: NVEnc And MediaEspresso 6.5| date=March 22, 2012 | publisher=Tom’s Hardware}}</ref> |
|||
=== Shuffle Instructions === |
|||
Like Intel’s Quick Sync, NVENC is currently exposed through a proprietary API, though Nvidia does have plans to provide NVENC usage through CUDA.<ref name="Tom’s Hardware"/> |
|||
The GK110 had a small number of instructions added to further improve performance. New shuffle instructions allow for threads within a warp to share data amongst themselves with an instruction that completes the normal store and load operations that previously required two accesses to local memory within one instruction, making the process around 6% faster than using local data storage. Atomic operations were also improved, with 9x increases in speed for some instructions and the addition of more atomic 64-bit operations, namely min, max, and, or, and xor.<ref name="anandtech-GK110-preview">{{cite web |last=Smith |first=Ryan |date=November 12, 2012 |title=Nvidia Launches Tesla K20 & K20X: GK110 Arrives At Last |url=http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/3 |website=AnandTech |language=en-US |access-date=September 19, 2015}}</ref> |
|||
=== |
=== Hyper-Q === |
||
Hyper-Q expands GK110 hardware work queues from 1 to 32. The significance of this being that having a single work queue meant that Fermi could be under occupied at times as there wasn't enough work in that queue to fill every SM. By having 32 work queues, GK110 can in many scenarios, achieve higher utilization by being able to put different task streams on what would otherwise be an idle SMX. The simple nature of Hyper-Q is further reinforced by the fact that it's easily mapped to MPI, a common message passing interface frequently used in HPC. As legacy MPI-based algorithms that were originally designed for multi-CPU systems that became bottlenecked by false dependencies now have a solution. By increasing the number of MPI jobs, it's possible to utilize Hyper-Q on these algorithms to improve the efficiency all without changing the code itself.<ref name=anandtech-GK110-preview /> |
|||
=== Dynamic Parallelism === |
|||
With GK110, Nvidia focuses on compute performance. With 7.1 billion transistors it is the biggest GPU in terms of transistor count, dwarfing the GK104 and GF110. GK110 is unrivaled from a fabrication and power consumption standpoint, but the end result is that the performance per watt is unmatched due to the fact that so many tasks (graphical and compute) are massively parallel and map well to the large arrays of streaming processors found in GK110. |
|||
Dynamic Parallelism ability is for kernels to be able to dispatch other kernels. With Fermi, only the CPU could dispatch a kernel, which incurs a certain amount of overhead by having to communicate back to the CPU. By giving kernels the ability to dispatch their own child kernels, GK110 can both save time by not having to go back to the CPU, and in the process free up the CPU to work on other tasks.<ref name=anandtech-GK110-preview /> |
|||
=== Grid Management Unit === |
|||
With GK110, increase in space and bandwidth for both the register file and the L2 cache are seen. At the SMX level, GK110 register file space has increased to 256KB composed of 65K 32bit registers, as compared to Fermi. As for the L2 cache, GK110 L2 cache space increased by up to 1.5MB, twice as big as GF110. Both the L2 cache and register file bandwidth have also doubled. |
|||
Enabling Dynamic Parallelism requires a new grid management and dispatch control system. The new Grid Management Unit (GMU) manages and prioritizes grids to be executed. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. The CUDA Work Distributor in Kepler holds grids that are ready to dispatch, and is able to dispatch 32 active grids, which is double the capacity of the Fermi CWD. The Kepler CWD communicates with the GMU via a bidirectional link that allows the GMU to pause the dispatch of new grids and to hold pending and suspended grids until needed. The GMU also has a direct connection to the Kepler SMX units to permit grids that launch additional work on the GPU via Dynamic Parallelism to send the new work back to GMU to be prioritized and dispatched. If the kernel that dispatched the additional workload pauses, the GMU will hold it inactive until the dependent work has completed.<ref name="nvidia" /> |
|||
Performance in register-starved scenarios is also improved as there are more registers available to each thread. This goes in hand with an increase of total number of registers each thread can address, moving from 63 registers per thread to 255 registers per thread with GK110. |
|||
=== Nvidia GPUDirect === |
|||
With GK110, Nvidia also reworked the GPU texture cache to be used for compute. With 48KB in size, in compute the texture cache becomes a read-only cache, specializing in unaligned memory access workloads. Furthermore error detection capabilities have been added to make it safer for use with workloads that rely on ECC.<ref name="anandtech-GK110-preview">{{cite web | url=http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/3| title=NVIDIA Launches Tesla K20 & K20X: GK110 Arrives At Last | date=11/12/2012 | publisher=AnandTech}}</ref> |
|||
Nvidia GPUDirect is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple GPUs within the same system, significantly decreasing the latency of MPI send and receive messages to/from GPU memory.<ref>{{Cite news |title=Nvidia GPUDirect |url=https://developer.nvidia.com/gpudirect |website=Nvidia Developer |language=en-US |date=October 6, 2015 |access-date=February 5, 2019}}</ref> It also reduces demands on system memory bandwidth and frees the GPU DMA engines for use by other CUDA tasks. The Kepler GK110 die also supports other GPUDirect features including Peer‐to‐Peer and GPUDirect for Video. |
|||
=== Video decompression/compression === |
|||
== Features == |
|||
==== NVDEC ==== |
|||
The GeForce 700 Series contains features from both GK104 and GK110. Kepler based members of the 700 series add the following standard features to the GeForce family. |
|||
{{Main|Nvidia NVDEC}} |
|||
==== NVENC ==== |
|||
Derive from GK104 : |
|||
{{Main|Nvidia NVENC}} |
|||
NVENC is Nvidia's power efficient fixed-function encode that is able to take codecs, decode, preprocess, and encode H.264-based content. NVENC specification input formats are limited to H.264 output. But still, NVENC, through its limited format, can support up to 4096x4096 encode.<ref name="Tom’s Hardware">{{cite web |last=Angelini |first=Chris |date=March 22, 2012 |title=Benchmark Results: NVEnc And MediaEspresso 6.5 |url=http://www.tomshardware.com/reviews/geforce-gtx-680-review-benchmark,3161-16.html |website=Tom’s Hardware |language=en-US |access-date=September 19, 2015}}</ref> |
|||
Like Intel's QuickSync, NVENC is currently exposed through a proprietary API, though Nvidia does have plans to provide NVENC usage through CUDA.<ref name="Tom’s Hardware"/> |
|||
* [[PCI_Express#PCI_Express_3.0|PCI Express 3.0]] interface |
|||
== Performance == |
|||
* [[DisplayPort]] 1.2 |
|||
The theoretical single-precision processing power of a Kepler GPU in [[GFLOPS]] is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × core clock speed (in GHz). Note that like the previous generation [[Fermi (microarchitecture)#Performance|Fermi]], Kepler is not able to benefit from increased processing power by dual-issuing MAD+MUL like [[Tesla (microarchitecture)#Performance|Tesla]] was capable of. |
|||
* [[HDMI]] 1.4a 4K x 2K video output |
|||
* [[Purevideo#The_Fifth_Generation_PureVideo_HD|Purevideo VP5]] hardware video acceleration (up to 4K x 2K H.264 decode) |
|||
* Hardware H.264 encoding acceleration block (NVENC) |
|||
* Support for up to 4 independent 2D displays, or 3 stereoscopic/3D displays (NV Surround) |
|||
* Bindless Textures |
|||
* GPU Boost |
|||
* TXAA |
|||
* Manufactured by [[TSMC]] on a 28 nm process |
|||
The theoretical double-precision processing power of a Kepler GK110/210 GPU is 1/3 of its single precision performance. This double-precision processing power is however only available on professional [[Nvidia Quadro|Quadro]], [[Nvidia Tesla|Tesla]], and high-end Titan-branded [[GeForce]] cards, while drivers for consumer GeForce cards limit the performance to 1/24 of the single precision performance.<ref>{{cite news |last=Angelini |first=Chris |date=November 7, 2013 |title=Nvidia GeForce GTX 780 Ti Review: GK110, Fully Unlocked |url=http://www.tomshardware.com/reviews/geforce-gtx-780-ti-review-benchmarks,3663.html |website=Tom's Hardware |access-date=December 6, 2015 |page=1 |quote=The card's driver deliberately operates GK110’s FP64 units at 1/8 of the GPU’s clock rate. When you multiply that by the 3:1 ratio of single- to double-precision CUDA cores, you get a 1/24 rate}}</ref> The lower performance GK10x dies are similarly capped to 1/24 of the single precision performance.<ref>{{cite news |last=Smith |first=Ryan |date=13 September 2012 |title=The Nvidia GeForce GTX 660 Review: GK106 Fills Out The Kepler Family |url=http://www.anandtech.com/show/6276/nvidia-geforce-gtx-660-review-gk106-rounds-out-the-kepler-family |website=AnandTech |language=en-US |access-date=6 December 2015 |page=1}}</ref> |
|||
New Features from GK110 : |
|||
== Kepler dies == |
|||
* Compute Focus SMX Improvement |
|||
'''Kepler''' |
|||
* [[CUDA]] Compute Capability 3.5 |
|||
{| class="wikitable" style="text-align:center; height:3em; white-space:nowrap;" |
|||
* New Shuffle Instructions |
|||
! colspan="2" | |
|||
* Dynamic Parallelism |
|||
! style="width:12em; height:3em;" | GK104 |
|||
* Hyper-Q (Hyper-Q's MPI functionality reserve for Tesla only) |
|||
! style="width:12em; height:3em;" | GK106 |
|||
* Grid Management Unit |
|||
! style="width:12em; height:3em;" | GK107 |
|||
* NVIDIA GPUDirect (GPU Direct’s RDMA functionality reserve for Tesla only) |
|||
! style="width:12em; height:3em;" | GK110 |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Variant(s) |
|||
=== Compute Focus SMX Improvement === |
|||
| style="vertical-align:top;" | GK104-200-A2 <br /> GK104-300-A2 <br /> GK104-325-A2 <br /> GK104-400-A2 <br /> GK104-425-A2 <br /> GK104-850-A2 |
|||
| style="vertical-align:top;" | GK106-240-A1 <br /> GK107-400-A1 |
|||
| style="vertical-align:top;" | GK107-300-A2 <br /> GK107-301-A2 <br /> GK107-320-A2 <br /> GK107-400-A2 <br /> GK107-425-A2 <br /> GK107-450-A2 <br /> GK107-810-A2 |
|||
| style="vertical-align:top;" | GK110-300-A1 <br /> GK110-400-A1<br /> GK110-425-B1 <br /> GK110-885-A1 |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Release date |
|||
With GK110, Nvidia opted to increase compute performance. The single biggest change from GK104 is that rather than 8 dedicated FP64 CUDA cores, GK110 has up to 64, giving it 8x the FP64 throughput of a GK104 SMX. The SMX also sees an increase in space for register file. Register file space has increased to 256KB compared to Fermi. The texture cache are also improved. With a 48KB space, the texture cache can become a read-only cache for compute workloads.<ref name=anandtech-GK110-preview /> |
|||
| {{dts|2012|April|03|format=mdy|abbr=on}} |
|||
| {{dts|2012|September|06|format=mdy|abbr=on}} |
|||
| {{dts|2012|September|06|format=mdy|abbr=on}} |
|||
| {{dts|2012|November|12|format=mdy|abbr=on}} |
|||
|- |
|||
! rowspan="3" | Cores |
|||
=== New Shuffle Instructions === |
|||
! style="text-align: left;" | [[Unified shader model|CUDA Cores]] |
|||
At a low level, GK110 sees an additional instructions and operations to further improve performance. New shuffle instructions allow for threads within a warp to share data without going back to memory, making the process much quicker than the previous load/share/store method. Atomic operations are also overhauled, speeding up the execution speed of atomic operations and adding some FP64 operations that were previously only available for FP32 data.<ref name=anandtech-GK110-preview /> |
|||
| 1536 |
|||
| 960 |
|||
| 384 |
|||
| 2880 |
|||
|- |
|||
! style="text-align: left;" | [[Texture mapping unit|TMUs]] |
|||
| 128 |
|||
| 80 |
|||
| 32 |
|||
| 240 |
|||
|- |
|||
! style="text-align: left;" | [[Render output unit|ROPs]] |
|||
| 32 |
|||
| 24 |
|||
| 16 |
|||
| 48 |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Streaming Multiprocessors |
|||
=== Hyper-Q === |
|||
| 8 |
|||
Hyper-Q expands GK110 hardware work queues from 1 to 32. The significance of this being that having a single work queue meant that Fermi could be under occupied at times as there wasn’t enough work in that queue to fill every SM. By having 32 work queues, GK110 can in many scenarios, achieve higher utilization by being able to put different task streams on what would otherwise be an idle SMX. The simple nature of Hyper-Q is further reinforced by the fact that it’s easily map to MPI, a common message passing interface frequently used in HPC. As legacy MPI-based algorithms that were originally designed for multi-CPU systems that became bottlenecked by false dependencies now have a solution. By increasing the number of MPI jobs, it’s possible to utilize Hyper-Q on these algorithms to improve the efficiency all without changing the code itself.<ref name=anandtech-GK110-preview /> |
|||
| 5 |
|||
| 2 |
|||
| 15 |
|||
|- |
|||
! colspan="2" style="text-align: left;" | {{abbr|GPCs|Graphics Processing Clusters}} |
|||
=== Dynamic Parallelism === |
|||
| 4 |
|||
Dynamic Parallelism ability is for kernels to be able to dispatch other kernels. With Fermi, only the CPU could dispatch a kernel, which incurs a certain amount of overhead by having to communicate back to the CPU. By giving kernels the ability to dispatch their own child kernels, GK110 can both save time by not having to go back to the CPU, and in the process free up the CPU to work on other tasks.<ref name=anandtech-GK110-preview /> |
|||
| 3 |
|||
| 1 |
|||
| 5 |
|||
|- |
|||
! rowspan="2" | Cache |
|||
=== Grid Management Unit === |
|||
! style="text-align: left;" | L1 |
|||
Enabling Dynamic Parallelism requires a new grid management and dispatch control system. The new Grid Management Unit (GMU) manages and prioritizes grids to be executed. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. |
|||
| 128{{nbsp}}<small>KB</small> |
|||
The CUDA Work Distributor in Kepler holds grids that are ready to dispatch, and is able to dispatch 32 active grids, which is double the capacity of the Fermi CWD. The Kepler CWD communicates with the GMU via a bidirectional link that allows the GMU to pause the dispatch of new grids and to hold pending and suspended grids until needed. The GMU also has a direct connection to the Kepler SMX units to permit grids that launch additional work on the GPU via Dynamic Parallelism to send the new work back to GMU to be prioritized and dispatched. If the kernel that dispatched the additional workload pauses, the GMU will hold it inactive until the dependent work has completed. <ref>{{cite web | url= http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf | title= NVIDIA-Kepler-GK110-Architecture-Whitepaper|}}</ref> |
|||
| 80{{nbsp}}<small>KB</small> |
|||
| 32{{nbsp}}<small>KB</small> |
|||
| 240{{nbsp}}<small>KB</small> |
|||
|- |
|||
! style="text-align: left;" | L2 |
|||
| 512{{nbsp}}<small>KB</small> |
|||
| 512{{nbsp}}<small>KB</small> |
|||
| 256{{nbsp}}<small>KB</small> |
|||
| 1.5{{nbsp}}<small>MB</small> |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Memory interface |
|||
| 256-bit |
|||
| 192-bit |
|||
| 192-bit |
|||
| 384-bit |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Die size |
|||
| 294{{nbsp}}<small>mm<sup>2</sup></small> |
|||
| 221{{nbsp}}<small>mm<sup>2</sup></small> |
|||
| 118{{nbsp}}<small>mm<sup>2</sup></small> |
|||
| 561{{nbsp}}<small>mm<sup>2</sup></small> |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Transistor count |
|||
| 3.54{{nbsp}}<small>bn.</small> |
|||
| 2.54{{nbsp}}<small>bn.</small> |
|||
| 1.27{{nbsp}}<small>bn.</small> |
|||
| 7.08{{nbsp}}<small>bn.</small> |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Transistor density |
|||
| 12.0{{nbsp}}<small>MTr/mm<sup>2</sup></small> |
|||
| 11.5{{nbsp}}<small>MTr/mm<sup>2</sup></small> |
|||
| 10.8{{nbsp}}<small>MTr/mm<sup>2</sup></small> |
|||
| 12.6{{nbsp}}<small>MTr/mm<sup>2</sup></small> |
|||
|- |
|||
! colspan="2" style="text-align: left;" | Package socket |
|||
| BGA{{nbsp}}1745 |
|||
| BGA{{nbsp}}1425 |
|||
| BGA{{nbsp}}908 |
|||
| BGA{{nbsp}}2152 |
|||
|- |
|||
! colspan="7" | Products |
|||
|- |
|||
! rowspan="2" | Consumer |
|||
! style="text-align: left;" | Desktop |
|||
| style="vertical-align:top;" | GTX 660 <br /> GTX 660 Ti <br /> GTX 670 <br /> GTX 680 <br /> GTX 690 <br /> GTX 760 <br /> GTX 760 Ti <br /> GTX 770 |
|||
| style="vertical-align:top;" | GTX 650 <br /> GTX 650 Ti <br /> GTX 660 |
|||
| style="vertical-align:top;" | GT 630 <br /> GTX 650 <br /> GT 720 <br /> GT 730 <br /> GT 740 <br /> GT 1030 |
|||
| style="vertical-align:top;" | GTX 780 <br /> GTX Titan |
|||
|- |
|||
! style="text-align: left;" | Mobile |
|||
| style="vertical-align:top;" | GTX 670MX <br /> GTX 675MX <br /> GTX 680M <br /> GTX 680MX <br /> GTX 775M <br /> GTX 780M <br /> GTX 860M <br /> GTX 870M <br /> GTX 880M |
|||
| style="vertical-align:top;" | GTX 765M <br /> GTX 770M |
|||
| style="vertical-align:top;" | GT 640M <br /> GTX 640M LE <br /> GT 645M <br /> GT 650M <br /> GTX 660M <br /> GT 740M <br /> GT 745M <br /> GT 750M <br /> GT 755M <br /> GTX 810M <br /> GTX 820M |
|||
| style="vertical-align:top;" {{N/A}} |
|||
|- |
|||
! rowspan="2" | Workstation |
|||
! style="text-align: left;" | Desktop |
|||
| style="vertical-align:top;" | Quadro K4200 <br /> Quadro K5000 |
|||
| style="vertical-align:top;" | Quadro K4000 <br /> Quadro K5000 |
|||
| style="vertical-align:top;" | Quadro K410 <br /> Quadro K420 <br /> Quadro K600 <br /> Quadro K2000 <br /> Quadro K2000D |
|||
| style="vertical-align:top;" | Quadro K5200 <br /> Quadro K6000 |
|||
|- |
|||
! style="text-align: left;" | Mobile |
|||
| style="vertical-align:top;" | Quadro K3000M <br /> Quadro K3100M <br /> Quadro K4000M <br /> Quadro K4100M <br /> Quadro K5000M <br /> Quadro K5100M |
|||
| style="vertical-align:top;" {{N/A}} |
|||
| style="vertical-align:top;" | Quadro K100M <br /> Quadro K200M <br /> Quadro K500M <br /> Quadro K1000M <br /> Quadro K1100M <br /> Quadro K2000M |
|||
| style="vertical-align:top;" {{N/A}} |
|||
|- |
|||
|} |
|||
'''Kepler 2.0''' |
|||
* GK208 |
|||
* GK210 |
|||
* GK20A ([[Tegra#Tegra K1|Tegra K1]]) |
|||
== See also == |
|||
* [[List of eponyms of Nvidia GPU microarchitectures]] |
|||
* [[List of Nvidia graphics processing units]] |
|||
* [[Nvidia NVDEC]] |
|||
== References == |
|||
{{reflist|30em}} |
|||
{{Nvidia}} |
|||
[[Category:Nvidia microarchitectures]] |
|||
=== NVIDIA GPUDirect === |
|||
[[Category:Graphics microarchitectures|Nvidia Kepler]] |
|||
NVIDIA GPUDirect™ is a capability that enables GPUs within a |
|||
single computer, or GPUs in different servers located across a network, to directly exchange |
|||
data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows |
|||
third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple |
|||
GPUs within the same system, significantly decreasing the latency of MPI send and receive |
|||
messages to/from GPU memory. It also reduces demands on system memory bandwidth and |
|||
frees the GPU DMA engines for use by other CUDA tasks. Kepler GK110 also supports other |
|||
GPUDirect features including Peer‐to‐Peer and GPUDirect for Video. |
|||
==References== |
|||
{{reflist}} |
|||
<!--- After listing your sources please cite them using inline citations and place them after the information they cite. Please see http://en.wikipedia.org/wiki/Wikipedia:REFB for instructions on how to add citations. ---> |
|||
* |
|||
<!--- STOP! Be warned that by using this process instead of Articles for Creation, this article is subject to scrutiny. As an article in "mainspace", it will be DELETED if there are problems, not just declined. If you wish to use AfC, please return to the Wizard and continue from there. ---> |
Latest revision as of 04:57, 27 November 2024
Launched | April 3, 2012 |
---|---|
Designed by | Nvidia |
Manufactured by | |
Fabrication process | TSMC 28 nm |
Product Series | |
Desktop | |
Professional/workstation | |
Server/datacenter | |
Specifications | |
L1 cache | 16 KB (per SM) |
L2 cache | Up to 512 KB |
Memory support | GDDR5 |
PCIe support | PCIe 2.0 PCIe 3.0 |
Supported Graphics APIs | |
DirectX | DirectX 12 Ultimate (Feature Level 11_0) |
Shader Model | Shader Model 6.5 |
Vulkan | Vulkan 1.2 |
Media Engine | |
Encode codecs | H.264 |
Decode codecs | |
Encoder(s) supported | NVENC |
Display outputs | DVI DisplayPort 1.2 HDMI 1.4a |
History | |
Predecessor | Fermi |
Successor | Maxwell |
Support status | |
Unsupported |
Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012,[1] as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the Tegra K1 SoC, and in the Quadro Kxxx series, the Quadro NVS 510, and Tesla computing modules.
Kepler was followed by the Maxwell microarchitecture and used alongside Maxwell in the GeForce 700 series and GeForce 800M series.
The architecture is named after Johannes Kepler, a German mathematician and key figure in the 17th century scientific revolution.
Overview
[edit]The goal of Nvidia's previous architecture was design focused on increasing performance on compute and tessellation. With the Kepler architecture, Nvidia targeted their focus on efficiency, programmability, and performance.[2][3] The efficiency aim was achieved through the use of a unified GPU clock, simplified static scheduling of instruction and higher emphasis on performance per watt.[4] By abandoning the shader clock found in their previous GPU designs, efficiency is increased, even though it requires additional cores to achieve higher levels of performance. This is not only because the cores are more power-friendly (two Kepler cores using 90% power of one Fermi core, according to Nvidia's numbers), but also the change to a unified GPU clock scheme delivers a 50% reduction in power consumption in that area.[5]
Programmability aim was achieved with Kepler's Hyper-Q, Dynamic Parallelism and multiple new Compute Capabilities 3.x functionality. With it, higher GPU utilization and simplified code management was achievable with GK GPUs thus enabling more flexibility in programming for Kepler GPUs.[6]
Finally with the performance aim, additional execution resources (more CUDA cores, registers and cache) and with Kepler's ability to achieve a memory clock speed of 7 GHz, increases Kepler's performance when compared to previous Nvidia GPUs.[5][7]
Features
[edit]The GK Series GPU contains features from both the older Fermi and newer Kepler generations. Kepler based members add the following standard features:
- PCI Express 3.0 interface
- DisplayPort 1.2
- HDMI 1.4a 4K x 2K video output
- PureVideo VP5 hardware video acceleration (up to 4K x 2K H.264 decode)
- Hardware H.265 decoding[8]
- Hardware H.264 encoding acceleration block (NVENC)
- Support for up to 4 independent 2D displays, or 3 stereoscopic/3D displays (NV Surround)
- Next Generation Streaming Multiprocessor (SMX)
- Polymorph-Engine 2.0
- Simplified Instruction Scheduler
- Bindless Textures
- CUDA Compute Capability 3.0 to 3.5
- GPU Boost (Upgraded to 2.0 on GK110)
- TXAA Support
- Manufactured by TSMC on a 28 nm process
- New Shuffle Instructions
- Dynamic Parallelism
- Hyper-Q (Hyper-Q's MPI functionality reserve for Tesla only)
- Grid Management Unit
- Nvidia GPUDirect (GPU Direct's RDMA functionality reserve for Tesla only)
Next Generation Streaming Multiprocessor (SMX)
[edit]Kepler employs a new streaming multiprocessor architecture called SMX. CUDA execution core counts were increased from 32 per each of 16 SMs to 192 per each of 8 SMX; the register file was only doubled per SMX to 65,536 x 32-bit for an overall lower ratio; between this and other compromises, despite the 3x overall increase in CUDA cores and clock increase (on the 680 vs. the Fermi 580), the actual performance gains in most operations were well under 3x. Dedicated FP64 CUDA cores are used rather than treating two FP32 cores as a single unit as was done previously, and very few were included on the consumer models resulting in 1/24th speed FP64 calculation compared to FP32.[9]
On the HPC models, the GK110/210, the SMX count was raised to 13-15 depending on the product, and more FP64 cores were included to bring the compute ratio up to 1/3rd FP32. On the GK110, per-thread register limit was quadrupled over fermi to 255, but this still only allows a thread using half of the registers to parallelize to 1/4 of each SMX. The GK210 (released at the same time) increased the register limit to 512 to improve performance in high register pressure situations like this. Texture cache, which programmers had already been using for compute as a read-only buffer in previous generations, was increased in size and the data path optimized for faster throughput when using this method. All levels of memory including the register file are single-bit ECC as well.
Another notable feature is that while Fermi GPUs could only be accessed by one CPU thread at a time, the HPC Kepler GPUs added multithreading support so high core count processors could open 32 connections and more easily saturate the compute capability.[10]
Simplified Instruction Scheduler
[edit]Additional die space reduction and power saving was achieved by removing a complex hardware block that handled the prevention of data hazards.[3][5][11][12]
GPU Boost
[edit]GPU Boost is a new feature which is roughly analogous to turbo boosting of a CPU. The GPU is always guaranteed to run at a minimum clock speed, referred to as the "base clock". This clock speed is set to the level which will ensure that the GPU stays within TDP specifications, even at maximum loads.[3] When loads are lower, however, there is room for the clock speed to be increased without exceeding the TDP. In these scenarios, GPU Boost will gradually increase the clock speed in steps, until the GPU reaches a predefined power target of 170W by default (on the 680 card).[5] By taking this approach, the GPU will ramp its clock up or down dynamically, so that it is providing the maximum amount of speed possible while remaining within TDP specifications.
The power target, as well as the size of the clock increase steps that the GPU will take, are both adjustable via third-party utilities and provide a means of overclocking Kepler-based cards.[3]
Microsoft Direct3D Support
[edit]Nvidia Fermi and Kepler GPUs in the GeForce 600 series support the Direct3D 11.0 specification. Nvidia originally stated that the Kepler architecture has full DirectX 11.1 support, which includes the Direct3D 11.1 path.[13] The following "Modern UI" Direct3D 11.1 features, however, are not supported:[14][15]
- Target-Independent Rasterization (2D rendering only).
- 16xMSAA Rasterization (2D rendering only).
- Orthogonal Line Rendering Mode.
- UAV (Unordered Access View) in non-pixel-shader stages.
According to the definition by Microsoft, Direct3D feature level 11_1 must be complete, otherwise the Direct3D 11.1 path can not be executed.[16] The integrated Direct3D features of the Kepler architecture are the same as those of the GeForce 400 series Fermi architecture.[15]
Next Microsoft Direct3D Support
[edit]Nvidia Kepler GPUs of the GeForce 600/700 series support Direct3D 12 feature level 11_0.[17]
TXAA Support
[edit]Exclusive to Kepler GPUs, TXAA is a new anti-aliasing method from Nvidia that is designed for direct implementation into game engines. TXAA is based on the MSAA technique and custom resolve filters. It is designed to address a key problem in games known as shimmering or temporal aliasing. TXAA resolves that by smoothing out the scene in motion, making sure that any in-game scene is being cleared of any aliasing and shimmering.[3]
Shuffle Instructions
[edit]The GK110 had a small number of instructions added to further improve performance. New shuffle instructions allow for threads within a warp to share data amongst themselves with an instruction that completes the normal store and load operations that previously required two accesses to local memory within one instruction, making the process around 6% faster than using local data storage. Atomic operations were also improved, with 9x increases in speed for some instructions and the addition of more atomic 64-bit operations, namely min, max, and, or, and xor.[11]
Hyper-Q
[edit]Hyper-Q expands GK110 hardware work queues from 1 to 32. The significance of this being that having a single work queue meant that Fermi could be under occupied at times as there wasn't enough work in that queue to fill every SM. By having 32 work queues, GK110 can in many scenarios, achieve higher utilization by being able to put different task streams on what would otherwise be an idle SMX. The simple nature of Hyper-Q is further reinforced by the fact that it's easily mapped to MPI, a common message passing interface frequently used in HPC. As legacy MPI-based algorithms that were originally designed for multi-CPU systems that became bottlenecked by false dependencies now have a solution. By increasing the number of MPI jobs, it's possible to utilize Hyper-Q on these algorithms to improve the efficiency all without changing the code itself.[11]
Dynamic Parallelism
[edit]Dynamic Parallelism ability is for kernels to be able to dispatch other kernels. With Fermi, only the CPU could dispatch a kernel, which incurs a certain amount of overhead by having to communicate back to the CPU. By giving kernels the ability to dispatch their own child kernels, GK110 can both save time by not having to go back to the CPU, and in the process free up the CPU to work on other tasks.[11]
Grid Management Unit
[edit]Enabling Dynamic Parallelism requires a new grid management and dispatch control system. The new Grid Management Unit (GMU) manages and prioritizes grids to be executed. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. The CUDA Work Distributor in Kepler holds grids that are ready to dispatch, and is able to dispatch 32 active grids, which is double the capacity of the Fermi CWD. The Kepler CWD communicates with the GMU via a bidirectional link that allows the GMU to pause the dispatch of new grids and to hold pending and suspended grids until needed. The GMU also has a direct connection to the Kepler SMX units to permit grids that launch additional work on the GPU via Dynamic Parallelism to send the new work back to GMU to be prioritized and dispatched. If the kernel that dispatched the additional workload pauses, the GMU will hold it inactive until the dependent work has completed.[12]
Nvidia GPUDirect
[edit]Nvidia GPUDirect is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple GPUs within the same system, significantly decreasing the latency of MPI send and receive messages to/from GPU memory.[18] It also reduces demands on system memory bandwidth and frees the GPU DMA engines for use by other CUDA tasks. The Kepler GK110 die also supports other GPUDirect features including Peer‐to‐Peer and GPUDirect for Video.
Video decompression/compression
[edit]NVDEC
[edit]NVENC
[edit]NVENC is Nvidia's power efficient fixed-function encode that is able to take codecs, decode, preprocess, and encode H.264-based content. NVENC specification input formats are limited to H.264 output. But still, NVENC, through its limited format, can support up to 4096x4096 encode.[19]
Like Intel's QuickSync, NVENC is currently exposed through a proprietary API, though Nvidia does have plans to provide NVENC usage through CUDA.[19]
Performance
[edit]The theoretical single-precision processing power of a Kepler GPU in GFLOPS is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × core clock speed (in GHz). Note that like the previous generation Fermi, Kepler is not able to benefit from increased processing power by dual-issuing MAD+MUL like Tesla was capable of.
The theoretical double-precision processing power of a Kepler GK110/210 GPU is 1/3 of its single precision performance. This double-precision processing power is however only available on professional Quadro, Tesla, and high-end Titan-branded GeForce cards, while drivers for consumer GeForce cards limit the performance to 1/24 of the single precision performance.[20] The lower performance GK10x dies are similarly capped to 1/24 of the single precision performance.[21]
Kepler dies
[edit]Kepler
GK104 | GK106 | GK107 | GK110 | |||
---|---|---|---|---|---|---|
Variant(s) | GK104-200-A2 GK104-300-A2 GK104-325-A2 GK104-400-A2 GK104-425-A2 GK104-850-A2 |
GK106-240-A1 GK107-400-A1 |
GK107-300-A2 GK107-301-A2 GK107-320-A2 GK107-400-A2 GK107-425-A2 GK107-450-A2 GK107-810-A2 |
GK110-300-A1 GK110-400-A1 GK110-425-B1 GK110-885-A1 | ||
Release date | Apr 3, 2012 | Sep 6, 2012 | Sep 6, 2012 | Nov 12, 2012 | ||
Cores | CUDA Cores | 1536 | 960 | 384 | 2880 | |
TMUs | 128 | 80 | 32 | 240 | ||
ROPs | 32 | 24 | 16 | 48 | ||
Streaming Multiprocessors | 8 | 5 | 2 | 15 | ||
GPCs | 4 | 3 | 1 | 5 | ||
Cache | L1 | 128 KB | 80 KB | 32 KB | 240 KB | |
L2 | 512 KB | 512 KB | 256 KB | 1.5 MB | ||
Memory interface | 256-bit | 192-bit | 192-bit | 384-bit | ||
Die size | 294 mm2 | 221 mm2 | 118 mm2 | 561 mm2 | ||
Transistor count | 3.54 bn. | 2.54 bn. | 1.27 bn. | 7.08 bn. | ||
Transistor density | 12.0 MTr/mm2 | 11.5 MTr/mm2 | 10.8 MTr/mm2 | 12.6 MTr/mm2 | ||
Package socket | BGA 1745 | BGA 1425 | BGA 908 | BGA 2152 | ||
Products | ||||||
Consumer | Desktop | GTX 660 GTX 660 Ti GTX 670 GTX 680 GTX 690 GTX 760 GTX 760 Ti GTX 770 |
GTX 650 GTX 650 Ti GTX 660 |
GT 630 GTX 650 GT 720 GT 730 GT 740 GT 1030 |
GTX 780 GTX Titan | |
Mobile | GTX 670MX GTX 675MX GTX 680M GTX 680MX GTX 775M GTX 780M GTX 860M GTX 870M GTX 880M |
GTX 765M GTX 770M |
GT 640M GTX 640M LE GT 645M GT 650M GTX 660M GT 740M GT 745M GT 750M GT 755M GTX 810M GTX 820M |
— | ||
Workstation | Desktop | Quadro K4200 Quadro K5000 |
Quadro K4000 Quadro K5000 |
Quadro K410 Quadro K420 Quadro K600 Quadro K2000 Quadro K2000D |
Quadro K5200 Quadro K6000 | |
Mobile | Quadro K3000M Quadro K3100M Quadro K4000M Quadro K4100M Quadro K5000M Quadro K5100M |
— | Quadro K100M Quadro K200M Quadro K500M Quadro K1000M Quadro K1100M Quadro K2000M |
— |
Kepler 2.0
- GK208
- GK210
- GK20A (Tegra K1)
See also
[edit]- List of eponyms of Nvidia GPU microarchitectures
- List of Nvidia graphics processing units
- Nvidia NVDEC
References
[edit]- ^ Mujtaba, Hassan (18 February 2012). "Nvidia Expected to launch Eight New 28nm Kepler GPU's in April 2012".
- ^ "Inside Kepler" (PDF). Retrieved 2015-09-19.
- ^ a b c d e "Introducing The GeForce GTX 680 GPU". Nvidia. March 22, 2012. Retrieved 2015-09-19.
- ^ "Nvidia's Next Generation CUDA Compute Architecture: Kepler TM GK110" (PDF). Nvidia.
- ^ a b c d Smith, Ryan (March 22, 2012). "Nvidia GeForce GTX 680 Review: Retaking The Performance Crown". AnandTech. Retrieved November 25, 2012.
- ^ "Efficiency Through Hyper-Q, Dynamic Parallelism, & More". Nvidia. November 12, 2012. Retrieved 2015-09-19.
- ^ "GeForce GTX 770 | Specifications | GeForce". Nvidia. Retrieved 2022-06-07.
- ^ "NVIDIA GPU Decoder Device Information".
- ^ "GeForce 680 (Kepler) Whitepaper" (PDF). Nvidia. Retrieved March 22, 2024.
- ^ "Nvidia Kepler GK210/110 Architecture White Paper" (PDF). Nvidia. Retrieved 22 March 2024.
- ^ a b c d Smith, Ryan (November 12, 2012). "Nvidia Launches Tesla K20 & K20X: GK110 Arrives At Last". AnandTech. Retrieved September 19, 2015.
- ^ a b "Nvidia Kepler GK110 Architecture Whitepaper" (PDF). Nvidia. Retrieved 2015-09-19.
- ^ "Nvidia Launches First GeForce GPUs Based on Next-Generation Kepler Architecture". Nvidia. March 22, 2012. Archived from the original on June 14, 2013.
- ^ Edward, James (November 22, 2012). "Nvidia claims partially support DirectX 11.1". TechNews. Archived from the original on June 28, 2015. Retrieved 2015-09-19.
- ^ a b "Nvidia Doesn't Fully Support DirectX 11.1 with Kepler GPUs, But… (Web Archive Link)". BSN. Archived from the original on December 29, 2012.
- ^ "D3D_FEATURE_LEVEL enumeration (Windows)". MSDN. Retrieved 2015-09-19.
- ^ Moreton, Henry (March 20, 2014). "DirectX 12: A Major Stride for Gaming". Nvidia. Retrieved 2015-09-19.
- ^ "Nvidia GPUDirect". Nvidia Developer. October 6, 2015. Retrieved February 5, 2019.
- ^ a b Angelini, Chris (March 22, 2012). "Benchmark Results: NVEnc And MediaEspresso 6.5". Tom’s Hardware. Retrieved September 19, 2015.
- ^ Angelini, Chris (November 7, 2013). "Nvidia GeForce GTX 780 Ti Review: GK110, Fully Unlocked". Tom's Hardware. p. 1. Retrieved December 6, 2015.
The card's driver deliberately operates GK110's FP64 units at 1/8 of the GPU's clock rate. When you multiply that by the 3:1 ratio of single- to double-precision CUDA cores, you get a 1/24 rate
- ^ Smith, Ryan (13 September 2012). "The Nvidia GeForce GTX 660 Review: GK106 Fills Out The Kepler Family". AnandTech. p. 1. Retrieved 6 December 2015.