Parallel Thread Execution: Difference between revisions
CUDA is C++-like, not C-like. |
use closest syntax highlighting that gives acceptable output which assists reading better than text value e.g. comments and instructions, inline EXT->ref, def acro |
||
Line 1: | Line 1: | ||
'''Parallel Thread Execution''' ('''PTX''') is a pseudo-[[assembly language]] used in [[Nvidia]]'s [[CUDA]] programming environment. The [[NVIDIA CUDA Compiler|nvcc]] compiler translates code written in CUDA, a [[C++ (programming language)|C++]]-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into a binary code which can be run on the processing cores. |
'''Parallel Thread Execution''' ('''PTX''', or '''NVPTX'''<ref>https://llvm.org/docs/NVPTXUsage.html</ref>) is a pseudo-[[assembly language]] used in [[Nvidia]]'s [[CUDA]] programming environment. The [[NVIDIA CUDA Compiler|nvcc]] compiler translates code written in CUDA, a [[C++ (programming language)|C++]]-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into a binary code which can be run on the processing cores. |
||
== Registers == |
== Registers == |
||
PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form |
PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form |
||
<source lang=" |
<source lang="nasm"> |
||
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer |
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer |
||
</source> |
</source> |
||
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.: |
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.: |
||
<source lang=" |
<source lang="nasm"> |
||
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14 |
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14 |
||
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit |
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit |
||
Line 14: | Line 14: | ||
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is |
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is |
||
<source lang=" |
<source lang="nasm"> |
||
@%p14 bra $label; // branch to $label |
@%p14 bra $label; // branch to $label |
||
</source> |
</source> |
||
Line 35: | Line 35: | ||
Shared memory is declared in the PTX file via lines at the start of the form: |
Shared memory is declared in the PTX file via lines at the start of the form: |
||
<source lang=" |
<source lang="nasm"> |
||
.shared .align 8 .b8 pbatch_cache[15744]; // define 15744 bytes, aligned to an 8-byte boundary |
.shared .align 8 .b8 pbatch_cache[15744]; // define 15744 bytes, aligned to an 8-byte boundary |
||
</source> |
</source> |
||
Line 42: | Line 42: | ||
--> |
--> |
||
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and NVIDIA's CUDA compiler, nvcc. The |
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and NVIDIA's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.<ref>http://code.google.com/p/gpuocelot/</ref> |
||
==See also== |
==See also== |
Revision as of 20:59, 1 March 2018
Parallel Thread Execution (PTX, or NVPTX[1]) is a pseudo-assembly language used in Nvidia's CUDA programming environment. The nvcc compiler translates code written in CUDA, a C++-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into a binary code which can be run on the processing cores.
Registers
PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is
@%p14 bra $label; // branch to $label
The setp.cc.type instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a set instruction, where set.le.u32.u64 %r101, %rd12, %rd28 sets the 32-bit register %r101 to 0xffffffff if the 64-bit register %rd12 is less than or equal to the 64-bit register %rd28. Otherwise %r101 is set to 0x00000000.
There are a few predefined identifiers that denote pseudoregisters. Among others, %tid, %ntid, %ctaid, and %nctaid contain, respectively, thread indices, block dimensions, block indices, and grid dimensions.[2]
State spaces
Load (ld) and store (st) commands refer to one of several distinct state spaces (memory banks), e.g. ld.param. There are eight state spaces:[2]
- .reg : registers
- .sreg : special, read-only, platform-specific registers
- .const : shared, read-only memory
- .global : global memory, shared by all threads
- .local : local memory, private to each thread
- .param : parameters passed to the kernel
- .shared : memory shared between threads in a block
- .tex : global texture memory (deprecated)
Shared memory is declared in the PTX file via lines at the start of the form:
.shared .align 8 .b8 pbatch_cache[15744]; // define 15744 bytes, aligned to an 8-byte boundary
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and NVIDIA's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.[3]
See also
References
External links
- PTX ISA Version 1.4 NVIDIA, 2009-03-31
- PTX ISA Version 2.3 NVIDIA, 2011-11-03
- PTX ISA Version 3.2 NVIDIA, 2013-07-19
- PTX ISA Version 4.0 NVIDIA, 2014-04-12
- PTX ISA Version 4.3 NVIDIA, 2015-08-15
- PTX ISA Version 5.0 NVIDIA, 2017-06-xx
- PTX ISA Version 6.0 NVIDIA, 2017-09-xx
- PTX ISA page on NVIDIA Developer Zone
- GPU Ocelot, April 2011