Parallel Thread Execution: Difference between revisions
CUDA is C++-like, not C-like. |
Added Category named "Instruction set architectures" |
||
(38 intermediate revisions by 23 users not shown) | |||
Line 1: | Line 1: | ||
{{Primary sources|date=August 2020}} |
|||
'''Parallel Thread Execution''' ('''PTX''') is a pseudo-[[assembly language]] used in [[Nvidia]]'s [[CUDA]] programming environment. The [[NVIDIA CUDA Compiler|nvcc]] compiler translates code written in CUDA, a [[C++ (programming language)|C++]]-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into a binary code which can be run on the processing cores. |
|||
{{short description|Low-level parallel thread execution virtual machine and instruction set architecture}} |
|||
'''Parallel Thread Execution''' ('''PTX''' or '''NVPTX'''<ref>{{cite web|url=https://llvm.org/docs/NVPTXUsage.html|title=User Guide for NVPTX Back-end – LLVM 7 documentation|website=llvm.org}}</ref>) is a low-level [[Parallel computing|parallel]] [[Thread (computing)|thread]] [[Execution (computing)|execution]] [[virtual machine]] and [[instruction set architecture]] used in [[Nvidia]]'s Compute Unified Device Architecture ([[CUDA]]) programming environment. The [[Nvidia CUDA Compiler]] (NVCC) translates code written in CUDA, a [[C++]]-like language, into PTX instructions (an [[assembly language]] represented as American Standard Code for Information Interchange ([[ASCII]]) text), and the graphics driver contains a [[compiler]] which translates PTX instructions into executable binary code,<ref>{{Cite web|url=https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref|title=CUDA Binary Utilities|last=|first=|date=|website=docs.nvidia.com|language=en-us|archive-url=|archive-date=|access-date=2019-10-19}}</ref> which can run on the processing cores of [[List of Nvidia graphics processing units|Nvidia]] [[graphics processing unit]]s (GPUs). The [[GNU Compiler Collection]] also has basic ability to generate PTX in the context of [[OpenMP]] offloading.<ref>{{cite web |title=nvptx |url=https://gcc.gnu.org/wiki/nvptx |website=GCC Wiki}}</ref> Inline PTX assembly can be used in CUDA.<ref>{{Cite web|url=http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html|title=Inline PTX Assembly in CUDA|website=docs.nvidia.com|language=en-us|access-date=2019-11-03}}</ref> |
|||
== Registers == |
== Registers == |
||
PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally referring to consecutive registers. |
PTX uses an arbitrarily large [[processor register]] set; the output from the compiler is almost pure [[static single-assignment form]], with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form |
||
< |
<syntaxhighlight lang="ptx"> |
||
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer |
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer |
||
</syntaxhighlight> |
|||
</source> |
|||
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in |
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.: |
||
< |
<syntaxhighlight lang="ptx"> |
||
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14 |
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14 |
||
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit |
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit |
||
</syntaxhighlight> |
|||
</source> |
|||
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is |
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is |
||
< |
<syntaxhighlight lang="ptx"> |
||
@%p14 bra $label; // branch to $label |
@%p14 bra $label; // branch to $label |
||
</syntaxhighlight> |
|||
</source> |
|||
The < |
The <code>setp.cc.type</code> instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a <code>set</code> instruction, where <syntaxhighlight lang="ptx" inline>set.le.u32.u64 %r101, %rd12, %rd28</syntaxhighlight> sets the 32-bit register <code>%r101</code> to <code>0xffffffff</code> if the 64-bit register <code>%rd12</code> is less than or equal to the 64-bit register <code>%rd28</code>. Otherwise <code>%r101</code> is set to <code>0x00000000</code>. |
||
There are a few predefined identifiers that denote pseudoregisters. Among others, < |
There are a few predefined identifiers that denote pseudoregisters. Among others, <code>%tid, %ntid, %ctaid</code>, and <code>%nctaid</code> contain, respectively, thread indices, block dimensions, block indices, and grid dimensions.<ref name="ptx-isa">{{cite web|url=http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/ptx_isa_2.3.pdf|title=PTX ISA Version 2.3|publisher=}}</ref> |
||
== State spaces == |
== State spaces == |
||
Load (< |
Load (<code>ld</code>) and store (<code>st</code>) commands refer to one of several distinct state spaces (memory banks), e.g. <code>ld.param</code>. |
||
There are eight state spaces:<ref name="ptx-isa"/> |
There are eight state spaces:<ref name="ptx-isa"/> |
||
; <code>.reg</code> : registers |
|||
; <code>.sreg</code> : special, read-only, platform-specific registers |
|||
; <code>.const</code> : shared, read-only memory |
|||
; <code>.global</code> : global memory, shared by all threads |
|||
; <code>.local</code> : local memory, private to each thread |
|||
; <code>.param</code> : parameters passed to the kernel |
|||
; <code>.shared</code> : memory shared between threads in a block |
|||
; <code>.tex</code> : global texture memory (deprecated) |
|||
Shared memory is declared in the PTX file via lines at the start of the form: |
Shared memory is declared in the PTX file via lines at the start of the form: |
||
< |
<syntaxhighlight lang="ptx"> |
||
.shared .align 8 .b8 pbatch_cache[15744]; // define |
.shared .align 8 .b8 pbatch_cache[15744]; // define 15,744 bytes, aligned to an 8-byte boundary |
||
</syntaxhighlight> |
|||
</source> |
|||
<!-- mov.u64 %rd9, pbatch_cache; |
<!-- mov.u64 %rd9, pbatch_cache; |
||
Shared memory is generally addressed via a kernel-global pointer set up at the start of the kernel by |
Shared memory is generally addressed via a kernel-global pointer set up at the start of the kernel by |
||
--> |
--> |
||
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and |
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and Nvidia's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.<ref>{{cite web|url=https://github.com/gtcasl/gpuocelot|title=GPUOCelot: A dynamic compilation framework for PTX|last=|first=|date=7 November 2022|website=github.com|archive-url=|archive-date=|access-date=}}</ref> |
||
==See also== |
==See also== |
||
* [[Standard Portable Intermediate Representation]] |
* [[Standard Portable Intermediate Representation]] (SPIR) |
||
* [[CUDA binary]] (cubin) – a type of fat binary |
|||
== References == |
== References == |
||
Line 51: | Line 54: | ||
== External links == |
== External links == |
||
* |
*[https://docs.nvidia.com/cuda/parallel-thread-execution/ PTX ISA page on NVIDIA Developer Zone] |
||
* [http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_2.3.pdf PTX ISA Version 2.3] NVIDIA, 2011-11-03 |
|||
* [http://docs.nvidia.com/cuda/pdf/ptx_isa_3.2.pdf PTX ISA Version 3.2] NVIDIA, 2013-07-19 |
|||
* [http://docs.nvidia.com/cuda/pdf/ptx_isa_4.0.pdf PTX ISA Version 4.0] NVIDIA, 2014-04-12 |
|||
* [http://docs.nvidia.com/cuda/pdf/ptx_isa_4.3.pdf PTX ISA Version 4.3] NVIDIA, 2015-08-15 |
|||
* [http://docs.nvidia.com/cuda/pdf/ptx_isa_5.0.pdf PTX ISA Version 5.0] NVIDIA, 2017-06-xx |
|||
* [http://docs.nvidia.com/cuda/pdf/ptx_isa_6.0.pdf PTX ISA Version 6.0] NVIDIA, 2017-09-xx |
|||
* [http://docs.nvidia.com/cuda/parallel-thread-execution/ PTX ISA page on NVIDIA Developer Zone] |
|||
* [http://code.google.com/p/gpuocelot/ GPU Ocelot], April 2011 |
|||
[[Category:Nvidia]] |
[[Category:Nvidia]] |
||
[[Category:Instruction set architectures]] |
Latest revision as of 12:34, 27 December 2024
Parallel Thread Execution (PTX or NVPTX[1]) is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's Compute Unified Device Architecture (CUDA) programming environment. The Nvidia CUDA Compiler (NVCC) translates code written in CUDA, a C++-like language, into PTX instructions (an assembly language represented as American Standard Code for Information Interchange (ASCII) text), and the graphics driver contains a compiler which translates PTX instructions into executable binary code,[2] which can run on the processing cores of Nvidia graphics processing units (GPUs). The GNU Compiler Collection also has basic ability to generate PTX in the context of OpenMP offloading.[3] Inline PTX assembly can be used in CUDA.[4]
Registers
[edit]PTX uses an arbitrarily large processor register set; the output from the compiler is almost pure static single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is
@%p14 bra $label; // branch to $label
The setp.cc.type
instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a set
instruction, where set.le.u32.u64 %r101, %rd12, %rd28
sets the 32-bit register %r101
to 0xffffffff
if the 64-bit register %rd12
is less than or equal to the 64-bit register %rd28
. Otherwise %r101
is set to 0x00000000
.
There are a few predefined identifiers that denote pseudoregisters. Among others, %tid, %ntid, %ctaid
, and %nctaid
contain, respectively, thread indices, block dimensions, block indices, and grid dimensions.[5]
State spaces
[edit]Load (ld
) and store (st
) commands refer to one of several distinct state spaces (memory banks), e.g. ld.param
.
There are eight state spaces:[5]
.reg
- registers
.sreg
- special, read-only, platform-specific registers
.const
- shared, read-only memory
.global
- global memory, shared by all threads
.local
- local memory, private to each thread
.param
- parameters passed to the kernel
.shared
- memory shared between threads in a block
.tex
- global texture memory (deprecated)
Shared memory is declared in the PTX file via lines at the start of the form:
.shared .align 8 .b8 pbatch_cache[15744]; // define 15,744 bytes, aligned to an 8-byte boundary
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and Nvidia's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.[6]
See also
[edit]- Standard Portable Intermediate Representation (SPIR)
- CUDA binary (cubin) – a type of fat binary
References
[edit]- ^ "User Guide for NVPTX Back-end – LLVM 7 documentation". llvm.org.
- ^ "CUDA Binary Utilities". docs.nvidia.com. Retrieved 2019-10-19.
- ^ "nvptx". GCC Wiki.
- ^ "Inline PTX Assembly in CUDA". docs.nvidia.com. Retrieved 2019-11-03.
- ^ a b "PTX ISA Version 2.3" (PDF).
- ^ "GPUOCelot: A dynamic compilation framework for PTX". github.com. 7 November 2022.