Q (number format): Difference between revisions

Content deleted Content added

Inline

Revision as of 02:25, 6 July 2021

The Q notation is a succinct way to specify the parameters of a binary fixed point number format. It was introduced by Texas Instruments (TI) and has been used, with variations, by other software and electronics companies.^{[citation needed]}. A number of other notations have been used for the same purpose.

Definition

The Texas Instruments version

The Q notation, as defined by Texas Instruments, consists of the letter Q followed by a pair of numbers m.n, where m is the number of bits used for the integer part of the value, and n is the number of fraction bits.

By default, the notation describes signed binary fixed point format, with the unscaled integer being stored in two's complement format, used in most binary processors. The first bit always gives the sign of the value(1 = negative, 0 = non-negative), and it is not counted in the m parameter. Thus the total number w of bits used is 1 + m + n.

For example, the specification Q3.12 describes a signed binary fixed-point number with a w = 16 bits in total, comprising the sign bit, three bits for the integer part, and 12 bits that are assumed to be fraction. That is, a 16-bit signed (two's complement) integer, that is implicitly multiplied by the scaling factor 2⁻¹²

In particular, when n is zero, the the numbers are just integers – . If m is zero, all bits except the sign bit are fraction bits; then the range of the stored number is from −1.0 (inclusive) to +1 (exclusive). Both m and n may be negative

The m and the dot may be omitted, in which case they are inferred from the size of the variable or register where the value is stored. Thus Q12 means a signed integer with any number of bits, that is implicitly multiplied by 2⁻¹².

The letter U can be prefixed to the Q to denote an unsigned binary fixed-point format. For example, UQ1.15 describes values represented as unsigned 16-bit integers with implicit scaling factor of 2⁻¹⁵, which range from 0.0 to (2¹⁶-1)/2¹⁵ = +1.999969482421875.

The AMD version

A variant of the Q notation has been in use by AMD. In this variant, the m number includes the sign bit. For example, a 16-bit signed integer would be denoted Q15.0 in the TI variant, but Q16.0 in the AMD variant.^[1]^[2]

Characteristics

The resolution (difference between successive values) of a Qm.n or UQm.n format (AMD convention) is always 2⁻ⁿ. The range of representable values is

−2^m−1 to +2^m−1 − 2⁻ⁿ for signed format, and
0 to 2^m−1 − 2⁻ⁿ for the unsigned format.

For example, a Q15.1 format number requires 15+1 = 16 bits, has resolution 2⁻¹ = 0.5, and the representable values range from -2¹⁴ = -16384.0 to +2¹⁴ - 2⁻¹ = +16383.5. In hexadecimal, the negative values range from 0x8000 to 0xFFFF followed by the non-negative ones from 0x0000 to 0x7FFF.

Math operations

Q numbers are a ratio of two integers: the numerator is kept in storage, the denominator is equal to 2ⁿ.

Consider the following example:

The Q8 denominator equals 2⁸ = 256
1.5 equals 384/256
384 is stored, 256 is inferred because it is a Q8 number.

If the Q number's base is to be maintained (n remains constant) the Q number math operations must keep the denominator constant. The following formulas show math operations on the general Q numbers $N_{1}$ and $N_{2}$ .

${\begin{aligned}{\frac {N_{1}}{d}}+{\frac {N_{2}}{d}}&={\frac {N_{1}+N_{2}}{d}}\\{\frac {N_{1}}{d}}-{\frac {N_{2}}{d}}&={\frac {N_{1}-N_{2}}{d}}\\\left({\frac {N_{1}}{d}}\times {\frac {N_{2}}{d}}\right)\times d&={\frac {N_{1}\times N_{2}}{d}}\\\left({\frac {N_{1}}{d}}/{\frac {N_{2}}{d}}\right)/d&={\frac {N_{1}/N_{2}}{d}}\end{aligned}}$

Because the denominator is a power of two the multiplication can be implemented as an arithmetic shift to the left and the division as an arithmetic shift to the right; on many processors shifts are faster than multiplication and division.

To maintain accuracy the intermediate multiplication and division results must be double precision and care must be taken in rounding the intermediate result before converting back to the desired Q number.

Using C the operations are (note that here, Q refers to the fractional part's number of bits) :

Addition

int16_t q_add(int16_t a, int16_t b)
{
    return a + b;
}

With saturation

int16_t q_add_sat(int16_t a, int16_t b)
{
    int16_t result;
    int32_t tmp;

    tmp = (int32_t)a + (int32_t)b;
    if (tmp > 0x7FFF)
        tmp = 0x7FFF;
    if (tmp < -1 * 0x8000)
        tmp = -1 * 0x8000;
    result = (int16_t)tmp;

    return result;
}

Unlike floating point ±Inf, saturated results are not sticky and will unsaturate on adding a negative value to a positive saturated value (0x7FFF) and vice versa in that implementation shown. In assembly language, the Signed Overflow flag can be used to avoid the typecasts needed for that C implementation.

Subtraction

int16_t q_sub(int16_t a, int16_t b)
{
    return a - b;
}

Multiplication

// precomputed value:
#define K   (1 << (Q - 1))
 
// saturate to range of int16_t
int16_t sat16(int32_t x)
{
	if (x > 0x7FFF) return 0x7FFF;
	else if (x < -0x8000) return -0x8000;
	else return (int16_t)x;
}

int16_t q_mul(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    temp = (int32_t)a * (int32_t)b; // result type is operand's type
    // Rounding; mid values are rounded up
    temp += K;
    // Correct by dividing by base and saturate result
    result = sat16(temp >> Q);

    return result;
}

Division

int16_t q_div(int16_t a, int16_t b)
{
    /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */
    int32_t temp = (int32_t)a << Q;
    /* Rounding: mid values are rounded up (down for negative values). */
    /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */
    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {   
        temp += b / 2;    /* OR shift 1 bit i.e. temp += (b >> 1); */
    } else {
        temp -= b / 2;    /* OR shift 1 bit i.e. temp -= (b >> 1); */
    }
    return (int16_t)(temp / b);
}

References

^ "ARM Developer Suite AXD and armsd Debuggers Guide". 1.2. ARM Limited. 2001 [1999]. Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format. ARM DUI 0066D. Archived from the original on 2017-11-04.
^ "Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format". RealView Development Suite AXD and armsd Debuggers Guide (PDF). 3.0. ARM Limited. 2006 [1999]. pp. 4–24. ARM DUI 0066G. Archived (PDF) from the original on 2017-11-04.

External links

"Q-Number-Format Java Implementation". Archived from the original on 2017-11-04. Retrieved 2017-11-04.
"Q-format Converter". Archived from the original on 2021-06-25. Retrieved 2021-06-25.

[ARM_2001-1] "ARM Developer Suite AXD and armsd Debuggers Guide". 1.2. ARM Limited. 2001 [1999]. Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format. ARM DUI 0066D. Archived from the original on 2017-11-04.

[ARM_2006-2] "Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format". RealView Development Suite AXD and armsd Debuggers Guide (PDF). 3.0. ARM Limited. 2006 [1999]. pp. 4–24. ARM DUI 0066G. Archived (PDF) from the original on 2017-11-04.

[1]

[2]

@@ Line 7: / Line 7: @@
 ===The Texas Instruments version===
-The Q notation, as defined by Texas Instruments, consists of the letter <code>Q</code> followed by a pair of numbers ''m''<code>.</code>''f'', where ''m'' is the number of bits used for the integer part of the value, and ''f'' is the number of fraction bits. Thus, for example, the specification <code>Q3.12</code> describes a signed binary fixed-point number with 15 bits, not counting the sign bit, of which 12 are implied fraction. That is, a 16-bit signed integer in [[two's complement]] representation, that is implicitly multiplied by the scaling factor 2<sup>−12</sup>
+The Q notation, as defined by Texas Instruments, consists of the letter <code>Q</code> followed by a pair of numbers ''m''<code>.</code>''n'', where ''m'' is the number of bits used for the integer part of the value, and ''n'' is the number of fraction bits.
+By default, the notation describes ''signed'' binary fixed point format, with the unscaled integer being stored  in [[two's complement]] format, used in most binary processors. The first bit always gives the sign of the value(1 = negative, 0 = non-negative), and it is ''not'' counted in the ''m'' parameter.  Thus the total  number ''w'' of bits used is 1 + ''m'' + ''n''.
-The ''m'' and the dot may be omitted, in which case they are inferred from the size of the variable or register where the value is stored. Thus <code>Q12</code> means a signed integer with any number of bits, that is implicitly multiplied by 2<sup>−12</sup>.
+For example, the specification <code>Q3.12</code> describes a signed binary fixed-point number with a ''w'' = 16 bits in total, comprising the sign bit, three bits for the integer part, and 12 bits that are assumed to be fraction. That is, a 16-bit signed (two's complement) integer, that is implicitly multiplied by the scaling factor 2<sup>−12</sup>
-===The AMD version===
+In particular, when ''n'' is zero, the the numbers are just integers{{snd}}.  If ''m'' is zero, all bits except the sign bit are fraction bits; then the range of the stored number is from −1.0 (inclusive) to +1 (exclusive).  Both ''m'' and ''n'' may be negative
-A variant of the Q notation has been in use by [[AMD]].  In this variant, the ''m'' number includes the sign bit. For example, a 16-bit signed integer would be denoted <code>Q15.0</code> in the TI variant, but  <code>Q16.0</code> in the AMD variant.
+The ''m'' and the dot may be omitted, in which case they are inferred from the size of the variable or register where the value is stored. Thus <code>Q12</code> means a signed integer with any number of bits, that is implicitly multiplied by 2<sup>−12</sup>.
-==Characteristics==
-Some DSP architectures offer native support for common formats, such as Q1.15.  In this case, the processor can support arithmetic in one step, offering [[Saturation arithmetic|saturation]] (for addition and subtraction) and renormalization (for multiplication) in a single instruction.  Most standard CPUs do not.  If the architecture does not directly support the particular fixed point format chosen, the programmer will need to handle saturation and renormalization explicitly with bounds checking and bit shifting.
+The letter <code>U</code> can be prefixed to the <code>Q</code> to denote an ''unsigned'' binary fixed-point format.  For example, <code>UQ1.15</code> describes values represented as unsigned 16-bit integers with implicit scaling factor of 2<sup>−15</sup>, which range from 0.0 to (2<sup>16</sup>-1)/2<sup>15</sup> = +1.999969482421875.
-If n = 0, the Q numbers are integers{{snd}}.<ref name="ARM_2001"/><ref name="ARM_2006"/>
+===The AMD version===
-In addition, the letter U can be prefixed to the Q to indicate an unsigned value, such as UQ1.15, indicating values from 0.0 to +1.999969482421875 (that is, <math>1 + \frac{2^{15}-1}{2^{15}}</math>).
+A variant of the Q notation has been in use by [[AMD]].  In this variant, the ''m'' number includes the sign bit. For example, a 16-bit signed integer would be denoted <code>Q15.0</code> in the TI variant, but  <code>Q16.0</code> in the AMD variant.<ref name="ARM_2001"/><ref name="ARM_2006"/>
-Signed Q values are stored in [[two's complement]] format, just like signed integer values on most processors.  In two's complement, the sign bit is extended to the register size.
+==Characteristics==
-For a given Q''m''.''n'' format, using an ''m''+''n'' bit signed integer container with ''n'' fractional bits:
+The resolution (difference between successive values) of a Q''m''.''n'' or UQ''m''.''n'' format (AMD convention) is always 2<sup>−''n''</sup>.  The range of representable values is
-* its range is <math>[ - (2^{m-1}) , 2^{m-1} -2^{-n}]</math>
-* its resolution is <math>2^{-n}</math>
-For a given UQ''m''.''n'' format, using an ''m''+''n'' bit unsigned integer container with ''n'' fractional bits:
-* its range is <math>[ 0 , 2^m -2^{-n}]</math>
-* its resolution is <math>2^{-n}</math>
-For example, a Q15.1 format number:
+* −2<sup>''m''−1</sup> to +2<sup>''m''−1</sup> − 2<sup>−''n''</sup> for signed format, and
-* requires 15+1 = 16 bits
+* 0 to 2<sup>''m''−1</sup> − 2<sup>−''n''</sup> for the unsigned format.
-* its range is [-2<sup>14</sup>, 2<sup>14</sup> - 2<sup>−1</sup>] = [-16384.0, +16383.5] = [0x8000,  0x8001 … 0xFFFF, 0x0000, 0x0001 … 0x7FFE, 0x7FFF]
-* its resolution is 2<sup>−1</sup> = 0.5
+For example, a Q15.1 format number requires 15+1 = 16 bits, has resolution 2<sup>−1</sup> = 0.5, and the representable values range from -2<sup>14</sup> = -16384.0 to +2<sup>14</sup> - 2<sup>−1</sup> = +16383.5. In hexadecimal, the negative values range from 0x8000 to 0xFFFF followed by the non-negative ones from 0x0000 to 0x7FFF.
-Unlike [[floating point]] numbers, the resolution of Q numbers will remain constant over the entire range.
 ==Math operations==