Jump to content

Fixed-point arithmetic

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Jorge Stolfi (talk | contribs) at 05:26, 5 July 2021 (Precision loss and overflow: Removed now-redundant section.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computing, fixed-point refers to a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representing the cents (1/100 of dollar). Fixed-point number representation is often contrasted to the more complicated and computationally demanding floating-point representation.

In the fixed-point representation, the fraction is often expressed in the same number base as the integer part, but using negative powers of the base b. The most common variants of are decimal (base 10) and binary (base 2). Thus, if n fraction digits are stored, the value will always be an integer multiple of bn. Fixed-point representation can also be used to omit the low-order digits of integer values, e.g. when representing dollar values as multiples of $1000.

When decimal fixed-point numbers are displayed for human reading, the fraction digits are usually separated from those of the integer part by a radix character (usually '.' in English, but ',' or some other symbol in many other languages). Internally, however, there is no separation, and the distinction between the two groups of digits is defined only by the programs that handle such numbers.

Fixed-point representation was the norm in mechanical calculators. Since most modern processors have fast floating point unit (FPU), fixed-point representations are now used only in special situations, such as in low-cost embedded microprocessors and microcontrollers, or when their use is more natural for the problem. Examples of the latter are accounting of dollar amounts, when fractions of cents must be rounded to whole cents in strictly prescribed ways; and the evaluation of functions by table lookup.

Representation

Fixed-point representation with scaling 1/100
Value
represented
Internal
representation
0.00 0
0.5 50
0.99 99
2 200
−14.1 −1410
314.160 31416

A fixed-point representation of a fractional number is essentially an integer that is to be implicitly multiplied by a fixed scaling factor. For example, the value 1.23 can be stored in a variable as the integer value 1230 with implicit scaling factor of 1/1000, and the value 1,230,000 can be represented as 1230 with an implicit scaling factor of 1000.

A program will usually assume that all fixed-point values that will be stored into a given variable, or will be produced by a given instruction, will have the same scaling factor. The scaling factor of a variable or formula may not appear explicitly in the program. Good programming practice then requires that it be provided in the documentation, at least as a comment in the source code.

Choice of scaling factors

Even with the most careful rounding, fixed-point values represented with scaling factor S may have an error of up to ±0.5 in the stored integer, that is, ±0.5S in the value. Therefore, smaller scaling factors generally produce more accurate results.

On the other hand, a smaller scaling factor means a smaller range of the values that can be stored in a given program variable. The maximum fixed-point value that can be stored into a variable is the largest iinteger value that can be stored into it, multiplied by the scaling factor; and similarly for the minimum value.

For greater efficiency, scaling factors are often chosen to be powers (positive or negative) of the base b used to represent the integers internally. However, often the best scaling factor is dictated by the application. Thus one often uses scaling factors that are powers of 10 (e.g. 1/100 for dollar values), for human convenience, even when the integers are represented internally in binary. Decimal scaling factors also mesh well with the metric (SI) system, since the choice of the fixed-point scaling factor is often equivalent to the choice of a unit of measure (like centimetres or micronss instead of metres).

However, other scaling factors may be used occasionally, e.g. a time value in hours may be represented as a fixed-point type with a scale factor of 1/3600 to obtain values with one-second accuracy.

Any binary fraction a/2m, such as 1/16 or 17/32, can be exactly represented in fixed-point, with a power-of-two scaling factor 1/2n with any nm. However, most decimal fractions like 0.1 or 0.123 are infinite repeating fractions in base 2. and hence cannot be represented that way.

Similarly, any decimal fraction a/10m, such as 1/100 or 37/1000, can be exactly represented in fixed point with a power-of-ten scaling factor 1/10n with any nm. This decimal format can also represent any binary fraction a/2m, such as 1/8 (0.125) or 17/32 (0.53125).

Comparison with floating-point

Fixed-point computations can be faster and/or use less hardware than floating-point ones. If the range of the values to be represented is known in advance and is sufficiently limited, fixed point can make better use of the available bits. For example, if 32 bits are available to represent a number between 0 and 1, a fixed-point representation can have error less than 1.2 × 10−10, whereas the standard floating-point representation may have error up to 596 × 10−10 — because 9 of the bits are wasted with the sign and exponent of the dynamic scaling factor.

Programs using fixed-point computations are usually more portable than those using floating-point, since they don't depend on the availabilty of an FPU. This advantage was particularly strong before the IEEE Floating Point Standard was widely adopted, when floating-point computations with the same data would yield different results depending on the manufacturer, and often on the computer model.

In many cases, the rounding and truncation errors of fixed-point computations are easier to analyze than those of the equivalent floating-point computations. On the other hand, the use of fixed point requires greater care by the programmer. Avoidance of overflow requires much tighter estimates for the ranges of variables and all intermediate values in the computation, and often also extra code to adjust the scaling factors.

Operations

Addition and subtraction

To add or subtract two values with the same implicit scaling factor, it is sufficient to add or subtract the underlying integers; the result will have their common implicit scaling factor, can thus can be stored in the same program variables as the operands. These operations yield the exact mathematical result, as long as no overflow occurs—that is, as long as the resulting integer can be stored in the receiving program variable. If the values have different scaling factors, then they must be converted to a common scaling factor before the operation.

Multiplication

To multiply two fixed-point numbers, it suffices to multiply the two underlying integers, and assume that the scaling factor of the result is the product of their scaling factors. The result will be exact, with no rounding, provided that it does not overflow the receiving variable.

For example, multiplying the numbers 123 scaled by 1/1000 (0.123) and 25 scaled by 1/10 (2.5) yields the integer 123×25 = 3075 scaled by (1/1000)×(1/10) = 1/10000, that is 3075/10000 = 0.3075. As another example, multiplying the first number by 155 implicitly scaled by 1/32 (155/32 = 4.84375) yields the integer 123×155 = 19065 with implicit scaling factor (1/100)×(1/32) = 1/32000, that is 19065/32000 = 0.59578125.

Division

To divide two fixed-point numbers, one takes the integer quotient of their underlying integers, and assumes that the scaling factor is the quotient of their scaling factors. In general, the first division requires rounding and therefore the result is not exact.

For example, division of 3456 scaled by 1/100 (34.56) and 1234 scaled by 1/1000 (1.234) yields the integer 3456÷1234 = 3 (rounded) with scale factor (1/100)/(1/1000) = 10, that is, 30. As another example, the division of the first number by 155 implicitly scaled by 1/32 (155/32 = 4.84375) yields the integer 3456÷155 = 22 (rounded) with implicit scaling factor (1/100)/(1/32) = 32/100 = 8/25, that is 22×32/100 = 7.04.

If the result is not exact, the error introduced by the rounding can be reduced or even eliminated by converting the dividend to a smaller scaling factor. For example, if r = 1.23 is represented as 123 with scaling 1/100, and s = 6.25 is represented as 6250 with scaling 1/1000, then simple division of the integers yields 123÷6250 = 0 (rounded) with scaling factor (1/100)/(1/1000) = 10. If r is first converted to 1,230,000 with scaling factor 1/1000000, the result will be 1,230,000÷625 = 197 (rounded) with scale factor 1/1000 (0.197). The exact value 1.23/6.25 is 0.1968.

Scaling conversion

In fixed-point computing it is often necessary to convert a value to a different scaling factor. This operation is necessary, for example:

  • To store a value into a program variable that has a different implicit scaling factor;
  • To convert two values to the same scaling factor, so that they can be added or subtracted;
  • To restore the original scaling factor of a value after multiplying or dividing it by another;
  • To improve the accuracy of the result of a division;
  • To ensure that the scaling factor of a product or quotient is a simple power like 10n 2n;
  • To ensure that the result of an operation can be stored into a program variable without overflow;
  • To reduce the cost of hardware that processes fixed-point data.

To convert a number from a fixed point type with scaling factor R to another type with scaling factor S, the underlying integer must be multiplied by the ratio R/S. Thus, for example, to convert the value 1.23 = 123/100 from scaling factor R=1/100 to one with scaling factor S=1/1000, the integer 123 must be multiplied by (1/100)/(1/1000) = 10, yielding the representation 1230/1000.

If S does not divide R (in particular, if the new scaling factor S is greater than the original R), the new integer may have to be rounded.

In particular, if r and s are fixed-point variables with implicit scaling factors R and S, the operation rr×s require multiplying the respective integer representations and explicitly dividing the result by S. The result may have to be rounded, and overflow may occur.

For example, if the common scaling factor is 1/100, multiplying 1.23 by 0.25 entails multiplying 123 by 25 to yield 3075 with an intermediate scaling factor of 1/10000. In order to return to the original scaling factor 1/100, the integer 3075 then must be multiplied by 1/100, that is, divided by 100, to yield either 31 (0.31) or 30 (0.30), depending on the rounding policy used.

Similarly, the operation rr/s will require dividing the integers and explicitly multiplying the quotient by S. Rounding and/or overflow may occur here too.

If the scaling factor is a power of the base used internall to represent the integer, changing the scaling factor requires only dropping low-order digits of the integer, or appending zero digits.

Applications

A common use of decimal fixed-point is for storing monetary values, where the inexact values of binary floating-point numbers are often a liability.

Binary fixed-point was widely used in the 1970s and 1980s for real-time computing that was mathematically intensive, such as flight simulation and in nuclear power plant control algorithms since the late 1960s. It is still used in many DSP applications and custom made microprocessors. Computations involving angles would use binary angular measurement (BAM).

Binary fixed point is used in the STM32G4 series built in CORDIC co-processors and in the discrete cosine transform (DCT) algorithms used to compress JPEG images.

Hardware support

Typical processors do not have specific support for fixed-point arithmetic. However, most computers with binary arithmetic have fast bit shift instructions that can multiply or divide an integer by any power of 2. These instructions can be used to quickly change scaling factors that are powers of 2, while preserving the sign of the number.

Early computers like the IBM 1620 and the Burroughs B3500 used a binary-coded decimal (BCD) representaton for integers, namely base 10 where each decimal digit was independently encoded with 4 bits. Some processors, such as microcontrollers, may still use it. In such machines, conversion of decimal scaling factors can be performed by bit shifts and/or by memory address manipulation.

Some processors can set a hardware overflow flag and/or generate an exception on the occurrence of an overflow.

Computer language support

Explicit support for fixed-point representation and computations was provided by a few computer languages, notably PL/I, COBOL, Ada, JOVIAL, and Coral 66. They provided fixed-point numerical data types, binary and/or decimal, that could be assigned to variables and functions. The compiler then would then automatically generate code to do the appropriate scaling conversions when translating statements like rs × t + u, when reading or writing the values of those variables, or when converting them to other data types such as floating-point.

Most of those languages were designed between 1940 and 1990. More modern languages usually do not offer any fixed-point data types or support for scaling factor conversion. That is also the case for several older languages that are still very popular, like FORTRAN, C and C++. The wide availability of fast floating-point processors, with strictly standardized behavior, have greatly reduced the demand for binary fixed point support. Similarly, the support for decimal floating point in some programming languages, like C# and Python, has removed most of the need for decimal fixed-point support. In the few situations that call for fixed-point operations, they can be implemented by the programmer, with explicit scaling conversion, in any programming language.

On the other hand, all relational databases and the SQL notation support fixed-point decimal arithmetic and storage of numbers. PostgreSQL has a special numeric type for exact storage of numbers with up to 1000 digits.[1]

Moreover, in 2008 the International Standards Organization (ISO) issued a proposal to extend the C programming language with fixed-point data types, for the benefit of programs running on embedded processors.[2] Also, the GNU Compiler Collection (GCC) has back-end support for fixed-point.[3][4]

A worked Example

Suppose there is the following multiplication with 2 fixed point 3 decimal place numbers.

(10.500)(1.050) =1*10.500 + 0.050*10.500 = 10.500+0.525000=11.025000

Note how since there are 3 decimal places we show the trailing zeros. To re-characterize this as an integer multiplication we must first multiply by 1000 (10^3) moving all the decimal places in to integer places, then we will multiply by (10^-3) to put them back the equation now looks like

(10.500)(10^(3))(10^(-3)) (1.050)(10^(3))(10^(-3))  (10^(-3))(10^(-3))
= (10500)(1050) (10^-6)
= 11 025 000
= 11.025000

This works equivalently if we choose a different base, notably base 2 for computing, since a bit shift is the same as a multiplication or division by an order of 2. Three decimal digits is equivalent to about 10 binary digits, so we should round 0.05 to 10 bits after the binary point. The closest approximation is then 0.0000110011.

10= 8+2=2^3+2^1
1=2^0
0.5= 2^-1
0.05= 0.0000110011_2

Thus our multiplication becomes

(1010.100)(2^3)(1.0000110011)(2^10) (2^-13)
=(1010100)(10000110011) (2^-13)
=(10110000010111100) (2^-13)
=1011.0000010111100

This rounds to 11.023 with three digits after the decimal point.

Notation

There are various notations used to represent word length and radix point in a binary fixed-point number. In the following list, f represents the number of fractional bits, m the number of magnitude or integer bits, s the number of sign bits, and b the total number of bits.

  • Qf: The "Q" prefix. For example, Q15 represents a number with 15 fractional bits. This notation is ambiguous since it does not specify the word length, however it is usually assumed that the word length is either 16 or 32 bits depending on the target processor in use.[5]
  • Qm.f: The unambiguous form of the "Q" notation. Since the entire word is a 2's complement integer, a sign bit is implied. For example, Q1.30 describes a number with 1 integer bit and 30 fractional bits stored as a 32-bit 2's complement integer.[5][6]
  • fxm.b: The "fx" prefix is similar to the above, but uses the word length as the second item in the dotted pair. For example, fx1.16 describes a number with 1 magnitude bit and 15 fractional bits in a 16 bit word.[7]
  • s:m:f: Yet other notations include a sign bit, such as this one used in the PS2 GS User's Guide.[8] It also differs from conventional usage by using a colon instead of a period as the separator. For example, in this notation, 0:8:0 represents an unsigned 8-bit integer.
  • (p,q) Used in the PL/I programming language, to specify p total digits (not including sign) with q after the radix point. q can be positive or negative, and the radix binary or decimal.
  • <s,p,i> As used with the LabVIEW programming language, for 'FXP' fixed point numbers. Where s is either + or ±, signifying either an unsigned or 2's complement signed number respectively. This format indicates p total digits with i being the 'integer part'. It is notable that this format allows for arbitrary placement of the 'units' place - which need not occur within the given digits. That is; whilst the total bits p must be a natural integer, i may be larger, zero, or even negative - these situations merely correspond to different overall scaling factors for the number.

Binary scaling

Binary scaling is a computer programming technique used typically in embedded C, DSP and assembler programs to implement non-integer operations by using the native integer arithmetic of the processor.

Overview

A representation of a value using binary scaling is more precise than a floating-point representation occupying the same number of bits, but typically represents values of a more limited range, therefore more easily leading to arithmetic overflow during computation. Implementation of operations using integer arithmetic instructions is often (but not always) faster than the corresponding floating-point instructions.

A position for the 'binary point' is chosen for each variable to be represented, and binary shifts associated with arithmetic operations are adjusted accordingly. The binary scaling corresponds in Q (number format) to the first digit, i.e. Q1.15 is a 16 bit integer scaled with one bit as integer and fifteen as fractional. A Bscal 1 or Q1.15 number would represent approximately 0.999 to −1.0.

To give an example, a common way to use integer arithmetic to simulate floating point, using 32-bit numbers, is to multiply the coefficients by 65536.

Using binary scientific notation, this will place the binary point at B16. That is to say, the most significant 16 bits represent the integer part the remainder are represent the fractional part. This means, as a signed two's complement integer B16 number can hold a highest value of and a lowest value of −32768.0. Put another way, the B number, is the number of integer bits used to represent the number which defines its value range. Remaining low bits (i.e. the non-integer bits) are used to store fractional quantities and supply more accuracy.

For instance, to represent 1.2 and 5.6 as B16 one multiplies them by 216, giving 78643 and 367001 as the closest B16 representations.

Multiplying these together gives

28862059643

To convert it back to B16, divide it by 216 and round.

This gives 440400B16, which when converted back to a decimal number (by dividing again by 216) gives 6.71997 approximately. The correct result is 6.72.

Re-scaling after multiplication

The example above for a B16 multiplication is a simplified example. Re-scaling depends on both the B scale value and the word size. B16 is often used in 32 bit systems because it works simply by multiplying and dividing by 65536 (or shifting 16 bits).

Consider the Binary Point in a signed 32 bit word thus:

0 1 2 3 4 5 6 7 8 9
 S X X X X X X X   X X X X X X X X   X X X X X X X X   X X X X X X X X

where S is the sign bit and X are the other bits.

Placing the binary point at

  • 0 gives a range of −1.0 to 0.999999.
  • 1 gives a range of −2.0 to 1.999999
  • 2 gives a range of −4.0 to 3.999999 and so on.

When using different B scalings and/or word sizes the complete B scaling conversion formula must be used.

Consider a 32 bit word size, and two variables, one with a B scaling of 2 and the other with a scaling of 4.

1.4 @ B2 is 1.4 * (2 ^ (wordsize-2-1)) == 1.4 * 2 ^ 29 == 0x2CCCCCCD

Note that here the 1.4 values is very well represented with 30 fraction bits. A 32 bit floating-point number has 23 bits to store the fraction in. This is why B scaling is always more accurate than floating point of the same word size. This is especially useful in integrators or repeated summing of small quantities where rounding error can be a subtle but very dangerous problem when using floating point.

Now a larger number 15.2 at B4.

15.2 @ B4 is 15.2 * (2 ^ (wordsize-4-1)) == 15.2 * 2 ^ 27 == 0x7999999A

The number of bits to store the fraction is 28 bits. Multiplying these 32 bit numbers give the 64 bit result 0x1547AE14A51EB852

This result is in B7 in a 64 bit word. Shifting it down by 32 bits gives the result in B7 in 32 bits.

0x1547AE14

To convert back to a real number, divide this by (2^(wordsize-7-1)), resulting in 21.2800000099.

Various scalings may be used. B0 for instance can be used to represent any number between −1 and 0.999999999.

Software application examples

  • The Nest Labs Utilities library, provides a limited set of macros and functions for fixed point numbers, particularly when dealing with those numbers in the context of sensor sampling and sensor outputs.
  • GnuCash is an application for tracking money which is written in C. It switched from a floating-point representation of money to a fixed-point implementation as of version 1.6. This change was made to trade the less predictable rounding errors of floating-point representations for more control over rounding (for example, to the nearest cent).
  • Tremor, Toast and MAD are software libraries which decode the Ogg Vorbis, GSM Full Rate and MP3 audio formats respectively. These codecs use fixed-point arithmetic because many audio decoding hardware devices do not have an FPU (partly to save money, but primarily to save power - integer units are much smaller in silicon area than an FPU) and audio decoding requires performance to the extent a software implementation of floating-point on low-speed devices would not produce output in real time.
  • All 3D graphics engines on Sony's original PlayStation, Sega's Saturn, Nintendo's Game Boy Advance (only 2D), Nintendo DS (2D and 3D), Nintendo Gamecube[9] and GP2X Wiz video game systems use fixed-point arithmetic for the same reason as Tremor and Toast: to gain throughput on an architecture without an FPU (e.g. the PlayStation included hardware support for 4.12bit values in its transformation coprocessor).
  • The OpenGL ES 1.x specification includes a fixed point profile, as it is an API aimed for embedded systems, which do not always have an FPU.
  • TeX uses fixed point with 16 bits after the binary point, in units of points, for all position calculations. TeX font metric files use 32-bit signed fixed-point numbers, with 12 bits to the left of the decimal.
  • The dc and bc programs are arbitrary precision calculators, but only keep track of a (user-specified) fixed number of fractional digits.
  • VisSim A visually programmed block diagram language supporting a fixed-point block set to allow simulation and automatic code generation of fixed-point operations. Both word size and radix point can be specified on an operator basis.
  • Fractint represents numbers as Q2.29 fixed-point numbers,[10] to speed up drawing on old PCs with 386 or 486SX processors, which lacked an FPU.
  • Doom was the last first-person shooter title by id Software to use a 16.16 fixed point representation for all of its non-integer computations, including map system, geometry, rendering, player movement etc. This was done in order for the game to be playable on 386 and 486SX CPUs without an FPU. For compatibility reasons, this representation is still used in modern Doom source ports.
  • Wavpack is a fixed-point, lossless audio compressor. Its creator, David Bryant, justifies the decision to implement fixed instead of floating point: " I believe that integer operations are less susceptible to subtle chip to chip variations that could corrupt the lossless nature of the compression."[11]
  • fixed point numbers are sometimes used for storing and manipulating images and video frames. Processors with SIMD units aimed at image processing may include instructions suitable for handling packed fixed point data.
  • The Q# programming language for the Azure quantum computers, that implement quantum logic gates, contains a standard numeric library for performing fixed-point arithmetic on registers of qubits.[12]
  • The TrueType font format uses the F26Dot6 format (32-bit signed fixed-point with 26 bits to the left of the decimal) for some numeric values in its instructions.[13] This format was chosen to provide the minimal amount of precision required for hinting and for performance reasons.[14]
  • The FixedPointNumbers package for Julia implements both 2n and 2n-1 scaled fixed point numbers. The latter are used as a foundation for image processing to provide a consistent scale for both "integer" and floating-point intensity values, thus avoiding the implicit "255 == 1.0" (non-equation) present in many other image processing frameworks.

See also

References

  1. ^ PostgreSQL manual, section 8.1.2. Arbitrary Precision Numbers
  2. ^ JTC1/SC22/WG14 (2008), status of TR 18037: Embedded C
  3. ^ GCC wiki, Fixed-Point Arithmetic Support
  4. ^ Using GCC, section 5.13 Fixed-Point Types
  5. ^ a b Texas Instruments, TMS320C64x DSP Library Programmer's Reference, Appendix A.2
  6. ^ "MathWorks Fixed-Point Toolbox Documentation Glossary". mathworks.com.
  7. ^ Inc., solidThinking. "VisSim is now solidThinking Embed". www.vissim.com. {{cite web}}: |last= has generic name (help)
  8. ^ PS2 GS User's Guide, Chapter 7.1 "Explanatory Notes"
  9. ^ "Dolphin Emulator". Dolphin Emulator.
  10. ^ Fractint, A Little Code
  11. ^ "WavPack Technical Description". www.wavpack.com. Retrieved 2015-07-13.
  12. ^ "Introduction to the Quantum Numerics Library". Retrieved 2019-11-13.
  13. ^ "The TrueType Instruction Set: Data types".
  14. ^ "[Freetype] Why 26.6 ?".

Further reading