AVX-512

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture proposed by Intel in July 2013, and scheduled to be supported in 2015 with Intel's Knights Landing processor.^[1]

AVX-512 consists of multiple extensions not all meant to be supported by all processors implementing them. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations.

The instruction set consists of the following:

AVX-512 Foundation – expands most 32-bit and 64-bit based AVX instructions with EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control
AVX-512 Conflict Detection Instructions (CDI) – efficient conflict detection to allow more loops to be vectorized, supported by Knights Landing^[1]
AVX-512 Exponential and Reciprocal Instructions (ERI) – exponential and reciprocal operations designed to help implement transcendental operations, supported by Knights Landing^[1]
AVX-512 Prefetch Instructions (PFI) – new prefetch capabilities, supported by Knights Landing^[1]

Encoding and features

The VEX prefix used by AVX and AVX2, while flexible did not leave enough room to for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX.

Compared to VEX. EVEX adds the following benefits:^[2]

Expanded register encoding allowing 32 512-bit registers.
Adds 7 new opmask registers for masking most AVX-512 instructions.
Adds a new scalar memory mode that automatically performs a broadcast.
Adds room for explicit rounding control in each instruction.
Adds a new compressed displacement memory addressing mode.

SIMD modes

Like the VEX prefix, the EVEX prefix allow multiple different register width modes. In EVEX case it supports 128-bit, 256-bit and 512-bit modes. The means most SSE and AVX instructions have new AVX-512 versions that allow them to access the new features above such as opmask and more addressable registers. Unlike AVX-256, the new instructions does not have new names but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous.

Unlike SSE and VEX-based AVX, AVX-512 only supports 32- and 64-bit values.^[2]

Name	Standards	Registers	Types
Legacy SSE	SSE-SSE4.2	xmm0-xmm15	bytes, words, doublewords, quadwords, single float and double float (from SSE2)
AVX-128 (VEX)	AVX, AVX2	xmm0-xmm15	single float and double float. From AVX2: bytes, words, doublewords, quadwords
AVX-256 (VEX)	AVX, AVX2	ymm0-ymm15	single float and double float. From AVX2: bytes, words, doublewords, quadwords
AVX-128 (EVEX)	AVX-512F	xmm0-xmm31 (k1-k7)	doublewords, quadwords, single float and double float
AVX-256 (EVEX)	AVX-512F	ymm0-ymm31 (k1-k7)	doublewords, quadwords, single float and double float
AVX-512 (EVEX)	AVX-512F	zmm0-zmm31 (k1-k7)	doublewords, quadwords, single float and double float

Extended registers

AVX-512 (ZMM) register scheme as extension from the AVX (YMM) and SSE (XMM) registers
511 256	255 128	127 0

ZMM0	YMM0	XMM0
ZMM1	YMM1	XMM1
ZMM2	YMM2	XMM2
ZMM3	YMM3	XMM3
ZMM4	YMM4	XMM4
ZMM5	YMM5	XMM5
ZMM6	YMM6	XMM6
ZMM7	YMM7	XMM7
ZMM8	YMM8	XMM8
ZMM9	YMM9	XMM9
ZMM10	YMM10	XMM10
ZMM11	YMM11	XMM11
ZMM12	YMM12	XMM12
ZMM13	YMM13	XMM13
ZMM14	YMM14	XMM14
ZMM15	YMM15	XMM15
ZMM16	YMM16	XMM16
ZMM17	YMM17	XMM17
ZMM18	YMM18	XMM18
ZMM19	YMM19	XMM19
ZMM20	YMM20	XMM20
ZMM21	YMM21	XMM21
ZMM22	YMM22	XMM22
ZMM23	YMM23	XMM23
ZMM24	YMM24	XMM24
ZMM25	YMM25	XMM25
ZMM26	YMM26	XMM26
ZMM27	YMM27	XMM27
ZMM28	YMM28	XMM28
ZMM29	YMM29	XMM29
ZMM30	YMM30	XMM30
ZMM31	YMM31	XMM31

The extended registers, SIMD width, and opmask registers of AVX-512 all require OS support. Each set however has its own unique feature bits. While this could theoretically be used to only indicate support for some of the features; all three are required for AVX-512. Only the opmask registers may be used alone to extend traditional AVX-256 without full AVX-512 support.

While all features are required for AVX-512, that does not mean the extended register only work in 512-bit mode. All of the new registers 16-31 are also available to AVX-128 and AVX-256 modes of the EVEX prefix.

Opmask registers

Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). The first one k0 is however a hardcoded constant used to indicate unmasked operations. The opmask are in most instructions used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.

The opmask registers are currently 16-bit wide, but can potentially be expanded up to 64 bits.^[2] How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.

The opmask register is the reason why several bitwise instructions which naturally have no element widths, had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle, now exist in both double-word and quad-word variants with the only difference being in the final masking.

New opmask instructions

The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded.

Instruction	Description
`KANDW`	Bitwise logical AND Masks
`KANDNW`	Bitwise logical AND NOT Masks
`KMOVW`	Move from and to Mask Registers
`KUNPCKBW`	Unpack for Mask Registers
`KNOTW`	NOT Mask Register
`KORW`	Bitwise logical OR Masks
`KORTESTW`	OR Masks And Set Flags
`KSHIFTLW`	Shift Left Mask Registers
`KSHIFTRW`	Shift Right Mask Registers
`KXNORW`	Bitwise logical XNOR Masks
`KXORW`	Bitwise logical XOR Masks

New instructions in AVX-512 foundation

Blend using mask

There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of four blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.

Since blending is an integral part of the EVEX encoding, these instruction may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.

Instruction	Description
`VBLENDMPD`	Blend float64 vectors using opmask control
`VBLENDMPS`	Blend float32 vectors using opmask control
`VBLENDMPD`	Blend int32 vectors using opmask control
`VBLENDMPQ`	Blend int64 vectors using opmask control

Compare into mask

AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration however they save the result to a mask register and only support doubleword and quadword comparisons. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.^[2]

Immediate	Comparison	Description
0	EQ	Equal
1	LT	Less than
2	LE	Less than or equal
3	FALSE	Set to zero
4	NEQ	Not equal
5	NLT	Greater than or equal
6	NLE	Greater than
7	TRUE	Set to one

Instruction	Description
`VPCMPD` `VPCMPUD`	Compare signed/unsigned doublewords into mask
`VPCMPQ` `VPCMPUQ`	Compare signed/unsigned quadwords into mask

Logical set mask

The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note like the comparison instructions these take two opmask registers, one as destination and one a regular opmask.

Instruction	Description
`VPTESTMD`, `VPTESTMQ`	Logical AND and Set Mask
`VPTESTNMD`, `VPTESTNMQ`	Logical NAND and Set Mask

Compress and expand

The compress and expand instructions use the opmask in a slightly different way. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.

Instruction	Description
`VCOMPRESSPD`, `VCOMPRESSPS`	Store sparse packed double/single-precision floating-point values into dense memory
`VPCOMPRESSD`, `VPCOMPRESSQ`	Store sparse packed doubleword/quadword integer values into dense memory/register
`VEXPANDPD`, `VEXPANDPS`	Load sparse packed double/single-precision floating-point values from dense memory
`VPEXPANDD`, `VPEXPANDQ`	Load sparse packed doubleword/quadword integer values from dense memory/register

Permute

Eight new permute instructions have been specified that all take three arguments, two source registers and one index, and output the result by overwriting either the first source register or the index register.

Instruction	Description
`VPERMI2PD`, `VPERMI2Q`	Full 64-bit permute overwriting the index.
`VPERMI2PS`, `VPERMI2D`	Full 32-bit permute overwriting the index.
`VPERMT2PD`, `VPERMT2Q`	Full 64-bit permute overwriting first source.
`VPERMT2PS`, `VPERMT2D`	Full 32-bit permute overwriting first source.

Bitwise ternary logic

Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed.^[2]

The difference in the doubleword and quadword versions is only the application of the opmask.

Instruction	Description
`VPTERNLOGD`, `VPTERNLOGQ`	Bitwise Ternary Logic

Examples:

A0	A1	A2	Double AND (0x80)	Double OR (0xFE)	Bitwise blend (0xCA)
0	0	0	0	0	0
0	0	1	0	1	1
0	1	0	0	1	0
0	1	1	0	1	1
1	0	0	0	1	0
1	0	1	0	1	0
1	1	0	0	1	1
1	1	1	1	1	1

Floating point decomposition

Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.

Instruction	Description
`VGETEXPPD`, `VGETEXPPS`	Convert exponents of packed fp values into fp values
`VGETEXPSD`, `VGETEXPSS`	Convert exponent of scalar fp value into fp value
`VGETMANPD`, `VGETMANPS`	Extract vector of normalized mantissas from float32/float64 vector
`VGETMANSD`, `VGETMANSS`	Extract float32/float64 of normalized mantissa from float32/float64 scalar
`VFIXUPIMMPD`, `VFIXUPIMMPS`	Fix up special packed float32/float64 values
`VFIXUPIMMSD`, `VFIXUPIMMSS`	Fix up special scalar float32/float64 value

Floating point arithmetics

This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2^-14.^[2]

Instruction	Description
`VRCP14PD`, `VRCP14PS`	Compute approximate reciprocals of packed float32/float64 values
`VRCP14SD`, `VRCP14SS`	Compute approximate reciprocals of scalar float32/float64 value
`VRNDSCALEPS`, `VRNDSCALEPD`	Round packed float32/float64 values to include a given number of fraction bits
`VRNDSCALESS`, `VRNDSCALESD`	Round scalar float32/float64 value to include a given number of fraction bits
`VRSQRT14PD`, `VRSQRT14PS`	Compute approximate reciprocals of square roots of packed float32/float64 values
`VRSQRT14SD`, `VRSQRT14SS`	Compute approximate reciprocal of square root of scalar float32/float64 value
`VSCALEFPS`, `VSCALEFPD`	Scale packed float32/float64 values with float32/float64 values
`VSCALEFSS`, `VSCALEFSD`	Scale scalar float32/float64 value with float32/float64 value

New instructions in AVX-512 conflict detection

The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.^[3]

Instruction	Name	Description
`VPCONFLICTD`, `VPCONFLICTQ`	Detect conflicts within vector of packed double- or quadwords values.	Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results.
`VPLZCNTD`, `VPLZCNTQ`	Count the number of leading zero bits for packed double- or quadword values.	Vectorized `LZCNT` instruction.
`VPBROADCASTMB2Q`,`VPBROADCASTMW2D`	Broadcast mask to vector register.	Either 8-bit mask to quadword vector, or 16-bit mask to doubleword vector.

New instructions in AVX-512 exponential and reciprocal

AVX-512 exponential and reciprocal instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2^-28. They also contain two new exponential functions that have a relative error of at most 2^-23.^[2]

Instruction	Description
`VEXP2PD`, `VEXP2PS`	Compute approximate exponential 2^x of packed single or double-precision floating point values
`VRCP28PD`, `VRCP28PS`	Compute approximate reciprocals of packed single or double-precision floating point values
`VRCP28SD`, `VRCP28SS`	Compute approximate reciprocal of scalar single or double-precision floating point value
`VRSQRT28PD`, `VRSQRT28PS`	Compute approximate reciprocals of square roots of packed single or double-precision floating point values
`VRSQRT28SD`, `VRSQRT28SS`	Compute approximate reciprocal of square root of scalar single or double-precision floating point value

New instructions in AVX-512 prefetch

AVX-512 prefetch instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512. T0 prefetch means prefetching into level 1 cache and T1 means prefetching into level 2 cache. PREFETCHWT1 has a separate CPUID and may be available without the remaining instructions.^{[citation needed]}

Instruction	Description
`VGATHERPF0DPS`, `VGATHERPF0QPS`, `VGATHERPF0DPD`, `VGATHERPF0QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint.
`VGATHERPF1DPS`, `VGATHERPF1QPS`, `VGATHERPF1DPD`, `VGATHERPF1QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint.
`VSCATTERPF0DPS`, `VSCATTERPF0QPS`, `VSCATTERPF0DPD`, `VSCATTERPF0QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write.
`VSCATTERPF1DPS`, `VSCATTERPF1QPS`, `VSCATTERPF1DPD`, `VSCATTERPF1QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.
`PREFETCHWT1`	Prefetch vector data into caches with intent to write and T1 hint

CPUs with AVX-512

Intel
- Xeon Phi Knights Landing: AVX3.1 (AVX-512F foundation plus AVX-512 CDI, AVX-512 PFI and AVX-512 ERI),^[1] in 2015
- Future Intel Processors to be named later^[1]
- speculation: Skylake: AVX3.2 (AVX-512F foundation plus TBA) processor, in 2015^[4]
- speculation: Cannonlake (AVX-512 foundation plus TBA) processor, in 2016

References

^ ^a ^b ^c ^d ^e ^f James Reinders (23 July 2013). "AVX-512 Instructions". Intel. Retrieved 20 August 2013.
^ ^a ^b ^c ^d ^e ^f ^g "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel. Retrieved 2014-01-29.
^ "AVX-512 Architecture/Demikhovsky Poster" (PDF). Intel. Retrieved 25 February 2014.
^ http://vr-zone.com/articles/new-details-intels-skylake-revealed/76848.html

[reinders512-1] ^ ^a ^b ^c ^d ^e ^f James Reinders (23 July 2013). "AVX-512 Instructions". Intel. Retrieved 20 August 2013.

[newisa-2] ^ ^a ^b ^c ^d ^e ^f ^g "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel. Retrieved 2014-01-29.

[3] "AVX-512 Architecture/Demikhovsky Poster" (PDF). Intel. Retrieved 25 February 2014.

[4] ttp://vr-zone.com/articles/new-details-intels-skylake-revealed/76848.html

[1]

[2]

[3]

[4]

v t e Instruction set extensions
SIMD (RISC)	Alpha MVI ARM NEON SVE MIPS MDMX MIPS-3D MXU MIPS SIMD PA-RISC MAX Power ISA VMX SPARC VIS
SIMD (x86)	MMX (1996) 3DNow! (1998) SSE (1999) SSE2 (2001) SSE3 (2004) SSSE3 (2006) SSE4 (2006) SSE5 ~~(2007)~~ AVX (2008) F16C (2009) XOP (2009) FMA (FMA4: 2011, FMA3: 2012) AVX2 (2013) AVX-512 (2015) AMX (2022) AVX10 (2023)
Bit manipulation	BMI (ABM: 2007, BMI1: 2012, BMI2: 2013, TBM: 2012) ADX (2014)
Compressed instructions	Thumb MIPS16e ASE RVC
Security and cryptography	PadLock (2003) AES-NI (2008); ARMv8 also has AES instructions CLMUL (2010) RDRAND (2012) SHA (2013) MPX (2015) SGX (2015) TDX (2021)
Transactional memory	TSX (2013) ASF
Virtualization	VT-x (2005) AMD-V (2006) VT-d (AMD-Vi)
Suspended extensions' dates are ~~struck through~~.