Bit array: Difference between revisions
m Removing link(s) Wikipedia:Articles for deletion/OpenVera closed as soft delete (XFDcloser) |
|||
(89 intermediate revisions by 63 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Array data structure that compactly stores bits}} |
|||
{{refimprove|date=December 2010}} |
|||
{{more citations needed|date=December 2010}} |
|||
A '''bit array''' (also known as ''' |
A '''bit array''' (also known as '''bitmask''',<ref name="linux" /> '''bit map''', '''bit set''', '''bit string''', or '''bit vector''') is an [[array data structure]] that compactly stores [[bit]]s. It can be used to implement a simple [[set data structure]]. A bit array is effective at exploiting [[bit-level parallelism]] in hardware to perform operations quickly. A typical bit array stores ''kw'' bits, where ''w'' is the number of bits in the unit of storage, such as a [[byte]] or [[Word (computer architecture)|word]], and ''k'' is some nonnegative integer. If ''w'' does not divide the number of bits to be stored, some space is wasted due to [[Fragmentation (computing)|internal fragmentation]]. |
||
== Definition == |
== Definition == |
||
A bit array is a mapping from some domain (almost always a range of integers) to values in the set {0, 1}. The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored in one bit. |
A bit array is a mapping from some domain (almost always a range of integers) to values in the set {0, 1}. The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored in one bit. As with other arrays, the access to a single bit can be managed by applying an index to the array. Assuming its size (or length) to be ''n'' bits, the array can be used to specify a subset of the domain (e.g. {0, 1, 2, ..., ''n''−1}), where a 1-bit indicates the presence and a 0-bit the absence of a number in the set. This set data structure uses about ''n''/''w'' words of space, where ''w'' is the number of bits in each [[Word (computer architecture)|machine word]]. Whether the least significant bit (of the word) or the most significant bit indicates the smallest-index number is largely irrelevant, but the former tends to be preferred (on [[Endianness|little-endian]] machines). |
||
A finite [[binary relation]] may be represented by a bit array called a [[logical matrix]]. In the [[calculus of relations]], these arrays are composed with [[matrix multiplication]] where the arithmetic is Boolean, and such a composition represents [[composition of relations]].<ref>[[Irving Copilowish]] (December 1948) "Matrix development of the calculus of relations", [[Journal of Symbolic Logic]] 13(4): 193–203 [https://www.jstor.org/stable/2267134?seq=1#page_scan_tab_contents Jstor link]</ref> |
|||
== Basic operations == |
== Basic operations == |
||
Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using [[bitwise operation]]s. In particular: |
Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using [[bitwise operation]]s. In particular: |
||
* OR can be used to set a bit to one: 11101010 OR 00000100 = 11101110 |
|||
* AND can be used to set a bit to zero: 11101010 AND 11111101 = 11101000 |
|||
* AND together with zero-testing can be used to determine if a bit is set: |
|||
::11101010 AND 00000001 = 00000000 = 0 |
|||
::11101010 AND 00000010 = 00000010 ≠ 0 |
|||
* XOR can be used to invert or toggle a bit: |
|||
::11101010 XOR 00000100 = 11101110 |
|||
::11101110 XOR 00000100 = 11101010 |
|||
* NOT can be used to invert a bit. |
|||
::NOT 10110010 = 01001101 |
|||
Use <code>OR</code> to set a bit to one: |
|||
To obtain the bit mask needed for these operations, we can use a [[Bitwise operation#Bit shifts|bit shift]] operator to shift the number 1 to the left by the appropriate number of places, as well as [[bitwise negation]] if necessary. |
|||
11101'''0'''10 |
|||
OR 00000'''1'''00 |
|||
= 11101'''1'''10 |
|||
<code>AND</code> to set a bit to zero: |
|||
111010'''1'''0 |
|||
AND 111111'''0'''1 |
|||
= 111010'''0'''0 |
|||
<code>AND</code> to determine if a bit is set, by zero-testing: |
|||
1110101'''0''' 111010'''1'''0 |
|||
AND 0000000'''1''' AND 000000'''1'''0 |
|||
= 0000000'''0''' = 000000'''1'''0 |
|||
(=0 ∴ bit isn't set) (≠0 ∴ bit is set) |
|||
<code>XOR</code> to invert or toggle a bit: |
|||
11101'''0'''10 11101'''1'''10 |
|||
XOR 00000'''1'''00 XOR 00000'''1'''00 |
|||
= 11101'''1'''10 = 11101'''0'''10 |
|||
<code>NOT</code> to invert all bits: |
|||
NOT 10110010 |
|||
= 01001101 |
|||
To obtain the [[Mask (computing)|bit mask]] needed for these operations, we can use a [[Bitwise operation#Bit shifts|bit shift]] operator to shift the number 1 to the left by the appropriate number of places, as well as [[bitwise negation]] if necessary. |
|||
Given two bit arrays of the same size representing sets, we can compute their [[union (set theory)|union]], [[intersection (set theory)|intersection]], and [[complement (set theory)|set-theoretic difference]] using ''n''/''w'' simple bit operations each (2''n''/''w'' for difference), as well as the [[Signed number representations#Ones' complement|complement]] of either: |
Given two bit arrays of the same size representing sets, we can compute their [[union (set theory)|union]], [[intersection (set theory)|intersection]], and [[complement (set theory)|set-theoretic difference]] using ''n''/''w'' simple bit operations each (2''n''/''w'' for difference), as well as the [[Signed number representations#Ones' complement|complement]] of either: |
||
'''for''' i '''from''' 0 '''to''' n/w-1 |
|||
<source lang=pascal> |
|||
complement_a[i] := '''not''' a[i] |
|||
for i from 0 to n/w-1 |
|||
union[i] := a[i] '''or''' b[i] |
|||
intersection[i] := a[i] '''and''' b[i] |
|||
difference[i] := a[i] '''and''' ('''not''' b[i]) |
|||
difference[i] := a[i] and (not b[i]) |
|||
</source> |
|||
If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Only ''n''/''w'' memory accesses are required: |
If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Only ''n''/''w'' memory accesses are required: |
||
<source lang=pascal> |
|||
'''for''' i '''from''' 0 to n/w-1 |
|||
index := 0 // if needed |
|||
word := a[i] |
|||
'''for''' b '''from''' 0 to w-1 |
|||
value := word '''and''' 1 ≠ 0 |
|||
word := word shift right 1 |
|||
// do something with value |
|||
index := index + 1 // if needed |
|||
</source> |
|||
Both of these code samples exhibit ideal [[locality of reference]], which will subsequently receive large performance boost from a data cache. If a cache line is ''k'' words, only about ''n''/''wk'' cache misses will occur. |
Both of these code samples exhibit ideal [[locality of reference]], which will subsequently receive large performance boost from a data cache. If a cache line is ''k'' words, only about ''n''/''wk'' cache misses will occur. |
||
== More complex operations == |
== More complex operations == |
||
As with [[String (computer science)|character strings]] it is straightforward to define ''length'', ''substring'', [[lexicographical order|lexicographical]] ''compare'', ''concatenation'', ''reverse'' operations. The implementation of some of these operations is sensitive to [[endianness]]. |
|||
=== Population / Hamming weight === |
=== Population / Hamming weight === |
||
If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar. See the [[Hamming weight]] article for examples of an efficient implementation. |
If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar. See the [[Hamming weight]] article for examples of an efficient implementation. |
||
=== Sorting === |
|||
Similarly, sorting a bit array is trivial to do in O(''n'') time using [[counting sort]] — we count the number of ones ''k'', fill the last ''k''/''w'' words with ones, set only the low ''k'' mod ''w'' bits of the next word, and set the rest to zero. |
|||
=== Inversion === |
=== Inversion === |
||
Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, |
Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, requires flipping the bits of individual words (so <code>b31 b30 ... b0</code> becomes <code>b0 ... b30 b31</code>). |
||
the bits of individual words (so <code>b31 b30 ... b0</code> becomes <code>b0 ... b30 b31</code>). |
|||
When this operation is not available on the processor, it's still possible to proceed by successive passes, in this example on 32 bits: |
When this operation is not available on the processor, it's still possible to proceed by successive passes, in this example on 32 bits: |
||
< |
<syntaxhighlight lang="text"> |
||
exchange two |
exchange two 16-bit halfwords |
||
exchange bytes by pairs (0xddccbbaa -> 0xccddaabb) |
exchange bytes by pairs (0xddccbbaa -> 0xccddaabb) |
||
... |
... |
||
Line 63: | Line 77: | ||
swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1) |
swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1) |
||
The last operation can be written ((x&0x55555555)<<1) | (x&0xaaaaaaaa)>>1)). |
The last operation can be written ((x&0x55555555) << 1) | (x&0xaaaaaaaa) >> 1)). |
||
</syntaxhighlight> |
|||
</source> |
|||
=== Find first one === |
=== Find first one === |
||
The [[find first |
The [[find first set]] or ''find first one'' operation identifies the index or position of the 1-bit with the smallest index in an array, and has widespread hardware support (for arrays not larger than a word) and efficient algorithms for its computation. When a [[priority queue]] is stored in a bit array, find first one can be used to identify the highest priority element in the queue. To expand a word-size ''find first one'' to longer arrays, one can find the first nonzero word and then run ''find first one'' on that word. The related operations ''find first zero'', ''count leading zeros'', ''count leading ones'', ''count trailing zeros'', ''count trailing ones'', and ''log base 2'' (see [[find first set]]) can also be extended to a bit array in a straightforward manner. |
||
== Compression == |
== Compression == |
||
A bit array is the most dense storage for "random" bits, that is, where each bit is equally likely to be 0 or 1, and each one is independent. But most data are not random, so it may be possible to store it more compactly. For example, the data of a typical fax image is not random and can be compressed. [[Run-length encoding]] is commonly used to compress these long streams. However, most compressed data formats are not so easy to access randomly; also by compressing bit arrays too aggressively we run the risk of losing the benefits due to bit-level parallelism ([[Array programming|vectorization]]). Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams of bytes or words (see [[Bitmap index#Compression|Bitmap index (compression)]]). |
|||
The specific compression technique and implementation details can affect performance. Thus, it might be helpful in practice to [https://github.com/lemire/simplebitmapbenchmark benchmark the various implementations]. |
|||
Examples: |
|||
* [http://code.google.com/p/compressedbitset/ compressedbitset]: WAH Compressed BitSet for Java |
|||
* [http://code.google.com/p/javaewah/ javaewah]: A compressed alternative to the Java BitSet class (using Enhanced WAH) |
|||
* [http://ricerca.mat.uniroma3.it/users/colanton/concise.html CONCISE]: COmpressed 'N' Composable Integer Set, another bitmap compression scheme for Java |
|||
* [http://github.com/lemire/EWAHBoolArray EWAHBoolArray]: A compressed bitmap/bitset class in C++ |
|||
* [http://code.google.com/p/csharpewah/ CSharpEWAH]: compressed bitset class in C# |
|||
* [http://code.google.com/p/sparsebitmap/ SparseBitmap]: a simple sparse bitmap implementation in Java |
|||
== Advantages and disadvantages == |
== Advantages and disadvantages == |
||
Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems: |
Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems: |
||
* They are extremely compact; |
* They are extremely compact; no other data structures can store ''n'' independent pieces of data in ''n''/''w'' words. |
||
* They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses. |
* They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses. |
||
* Because of their ability to exploit bit-level parallelism, limit memory access, and maximally use the [[data cache]], they often outperform many other data structures on practical data sets, even those that are more asymptotically efficient. |
* Because of their ability to exploit bit-level parallelism, limit memory access, and maximally use the [[data cache]], they often outperform many other data structures on practical data sets, even those that are more asymptotically efficient. |
||
However, bit arrays |
However, bit arrays are not the solution to everything. In particular: |
||
* Without compression, they are wasteful set data structures for sparse sets (those with few elements compared to their range) in both time and space. For such applications, compressed bit arrays, [[Judy array]]s, [[trie]]s, or even [[Bloom filter]]s should be considered instead. |
* Without compression, they are wasteful set data structures for sparse sets (those with few elements compared to their range) in both time and space. For such applications, compressed bit arrays, [[Judy array]]s, [[trie]]s, or even [[Bloom filter]]s should be considered instead. |
||
* Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing. |
* Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing. |
||
== Applications == |
== Applications == |
||
Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of |
Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of Boolean flags or an ordered sequence of Boolean values. |
||
Bit arrays are used for [[priority queue]]s, where the bit at index ''k'' is set if and only if ''k'' is in the queue; this data structure is used, for example, by the [[Linux kernel]], and benefits strongly from a find-first-zero operation in hardware. |
Bit arrays are used for [[priority queue]]s, where the bit at index ''k'' is set if and only if ''k'' is in the queue; this data structure is used, for example, by the [[Linux kernel]], and benefits strongly from a find-first-zero operation in hardware. |
||
Line 105: | Line 109: | ||
Bit arrays are also a useful abstraction for examining streams of [[data compression|compressed]] data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed [[Huffman coding]] representation of a single 8-bit character can be anywhere from 1 to 255 bits long. |
Bit arrays are also a useful abstraction for examining streams of [[data compression|compressed]] data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed [[Huffman coding]] representation of a single 8-bit character can be anywhere from 1 to 255 bits long. |
||
In [[information retrieval]], bit arrays are a good representation for the [[posting list]]s of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using [[unary coding]], the result is a bit array with a 1 bit in the ''n''th position if and only if ''n'' is in the list. The implied probability of a gap of ''n'' is 1/2<sup>''n''</sup>. This is also the special case of [[Golomb coding]] where the parameter M is 1; this parameter is only normally selected when |
In [[information retrieval]], bit arrays are a good representation for the [[posting list]]s of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using [[unary coding]], the result is a bit array with a 1 bit in the ''n''th position if and only if ''n'' is in the list. The implied probability of a gap of ''n'' is 1/2<sup>''n''</sup>. This is also the special case of [[Golomb coding]] where the parameter M is 1; this parameter is only normally selected when {{math|−log(2 − ''p'') / log(1 − ''p'') ≤ 1}}, or roughly the term occurs in at least 38% of documents. |
||
== Language support == |
== Language support == |
||
The [[APL (programming language)|APL programming language]] fully supports bit arrays of arbitrary shape and size as a Boolean datatype distinct from integers. All major implementations ([[APL (programming language)#Execution|Dyalog APL, APL2, APL Next, NARS2000, Gnu APL]], etc.) pack the bits densely into whatever size the machine word is. Bits may be accessed individually via the usual indexing notation (A[3]) as well as through all of the usual primitive functions and operators where they are often operated on using a special case algorithm such as summing the bits via a table lookup of bytes. |
|||
The [[C (programming language)|C programming language]]'s ''[[bitfield]]s'', pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bitwise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in the [[X11]] system, xtrapbits.h, is “a portable way for systems to define bit field manipulation of arrays of bits.” A more explanatory description of aforementioned approach can be found in the [http://c-faq.com/misc/bitsets.html comp.lang.c faq]. |
|||
The [[C (programming language)|C programming language]]'s ''[[bit field]]s'', pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bytewise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in the [[X11]] system, xtrapbits.h, is “a portable way for systems to define bit field manipulation of arrays of bits.” A more explanatory description of aforementioned approach can be found in the [http://c-faq.com/misc/bitsets.html comp.lang.c faq]. |
|||
In [[C++]], although individual <code>bool</code>s typically occupy the same space as a byte or an integer, the [[Standard Template Library|STL]] type <code>vector<bool></code> is a [[partial template specialization]] in which bits are packed as a space efficiency optimization. Since bytes (and not bits) are the smallest addressable unit in C++, the [] operator does ''not'' return a reference to an element, but instead returns a [[Proxy pattern|proxy reference]]. This might seem a minor point, but it means that <code>vector<bool></code> is ''not'' a standard STL container, which is why the use of <code>vector<bool></code> is generally discouraged. Another unique STL class, <code>bitset</code>,<ref name="c++" /> creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. The [[Boost C++ Libraries]] provide a <code>dynamic_bitset</code> class<ref name="boost" /> whose size is specified at run-time. |
|||
In [[C++]], although individual <code>bool</code>s typically occupy the same space as a byte or an integer, the [[Standard Template Library|STL]] type <code>vector<bool></code> is a [[partial template specialization]] in which bits are packed as a space efficiency optimization. Since bytes (and not bits) are the smallest addressable unit in C++, the [] operator does ''not'' return a reference to an element, but instead returns a [[Proxy pattern|proxy reference]]. This might seem a minor point, but it means that <code>vector<bool></code> is ''not'' a standard STL container, which is why the use of <code>vector<bool></code> is generally discouraged. Another unique STL class, <code>bitset</code>,<ref name="c++" /> creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. The [[Boost C++ Libraries]] provide a <code>dynamic_bitset</code> class<ref name="boost" /> whose size is specified at run-time. |
|||
The [[D programming language]] provides bit arrays in both of its competing standard libraries. In Phobos, they are provided in <code> std.bitmanip</code>, and in Tango, they are provided in <code>tango.core.BitArray</code>. As in C++, the [] operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead returns a <code>bool</code>. |
|||
The [[D programming language]] provides bit arrays in its standard library, Phobos, in <code>std.bitmanip</code>. As in C++, the [] operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead returns a <code>bool</code>. |
|||
In [[Java (programming language)|Java]], the class {{Javadoc:SE|java/util|BitSet}} creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike the <code>bitset</code> in C++, the Java <code>BitSet</code> does not have a "size" state (it has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a class {{Javadoc:SE|java/util|EnumSet}}, which represents a Set of values of an [[enumerated type]] internally as a bit vector, as a safer alternative to bitfields. |
|||
In [[Java (programming language)|Java]], the class {{Javadoc:SE|java/util|BitSet}} creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike the <code>bitset</code> in C++, the Java <code>BitSet</code> does not have a "size" state (it has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a class {{Javadoc:SE|java/util|EnumSet}}, which represents a Set of values of an [[enumerated type]] internally as a bit vector, as a safer alternative to bit fields. |
|||
The [[.NET Framework]] supplies a <code>BitArray</code> collection class. It stores boolean values, supports random access and bitwise operators, can be iterated over, and its <code>Length</code> property can be changed to grow or truncate it. |
|||
The [[.NET Framework]] supplies a <code>BitArray</code> collection class. It stores bits using an array of type <code>int</code> (each element in the array usually represents 32 bits).<ref>{{cite web|url=https://github.com/microsoft/referencesource/blob/master/mscorlib/system/collections/bitarray.cs|title=.NET mscorlib source code|website=github.com/microsoft|date=15 October 2021}}</ref> The class supports random access and bitwise operators, can be iterated over, and its <code>Length</code> property can be changed to grow or truncate it. |
|||
Although [[Standard ML]] has no support for bit arrays, Standard ML of New Jersey has an extension, the <code>BitArray</code> structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, unusually, shift operations. |
Although [[Standard ML]] has no support for bit arrays, Standard ML of New Jersey has an extension, the <code>BitArray</code> structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, unusually, shift operations. |
||
[[Haskell (programming language)|Haskell]] likewise currently lacks standard support for bitwise operations, but both GHC and Hugs provide a <code>Data.Bits</code> module with assorted bitwise functions and operators, including shift and rotate operations and an "unboxed" array over |
[[Haskell (programming language)|Haskell]] likewise currently lacks standard support for bitwise operations, but both [[Glasgow Haskell Compiler|GHC]] and Hugs provide a <code>Data.Bits</code> module with assorted bitwise functions and operators, including shift and rotate operations and an "unboxed" array over Boolean values may be used to model a Bit array, although this lacks support from the former module. |
||
In [[Perl]], strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (<code>~ | & ^</code>),<ref>http://perldoc.perl.org/perlop.html#Bitwise-String-Operators</ref> and individual bits can be tested and set using the ''vec'' function.<ref>http://perldoc.perl.org/functions/vec.html</ref> |
In [[Perl]], strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (<code>~ | & ^</code>),<ref>{{cite web|url=http://perldoc.perl.org/perlop.html#Bitwise-String-Operators|title=perlop - perldoc.perl.org|website=perldoc.perl.org}}</ref> and individual bits can be tested and set using the ''vec'' function.<ref>{{cite web|url=http://perldoc.perl.org/functions/vec.html|title=vec - perldoc.perl.org|website=perldoc.perl.org}}</ref> |
||
In [[Ruby (programming language)|Ruby]], you can access (but not set) a bit of an integer (<code>Fixnum</code> or <code>Bignum</code>) using the bracket operator (<code>[]</code>), as if it were an array of bits. |
In [[Ruby (programming language)|Ruby]], you can access (but not set) a bit of an integer (<code>Fixnum</code> or <code>Bignum</code>) using the bracket operator (<code>[]</code>), as if it were an array of bits. |
||
Apple's [[Core Foundation]] library contains [ |
Apple's [[Core Foundation]] library contains [https://developer.apple.com/library/mac/#documentation/CoreFoundation/Reference/CFBitVectorRef/Reference/reference.html CFBitVector] and [https://developer.apple.com/library/mac/#documentation/CoreFoundation/Reference/CFMutableBitVectorRef/Reference/reference.html#//apple_ref/doc/uid/20001500 CFMutableBitVector] structures. |
||
[[PL/I]] supports arrays of ''bit strings'' of arbitrary length, which may be either fixed-length or varying. |
[[PL/I]] supports arrays of ''bit strings'' of arbitrary length, which may be either fixed-length or varying. The array elements may be ''aligned''— each element begins on a byte or word boundary— or ''unaligned''— elements immediately follow each other with no padding. |
||
[[PL/pgSQL]] and PostgreSQL's SQL support ''bit strings'' as native type. There are two SQL bit types: <code>bit(''<code>n</code>'')</code> and <code>bit varying(''<code>n</code>'')</code>, where ''<code>n</code>'' is a positive integer.<ref>{{Cite web|url=https://www.postgresql.org/docs/current/datatype-bit.html|title=8.10. Bit String Types|date=30 September 2021}}</ref> |
|||
Hardware description languages such as [[VHDL]], [[Verilog]], and [[SystemVerilog]] natively support bit vectors as these are used to model storage elements like [[Flip-flop (electronics)|flip-flops]], hardware busses and hardware signals in general. In hardware verification languages such as [[OpenVera]], [[e (verification language)|''e'']] and [[SystemVerilog]], bit vectors are used to sample values from the hardware models, and to represent data that is transferred to hardware during simulations. |
|||
Hardware description languages such as [[VHDL]], [[Verilog]], and [[SystemVerilog]] natively support bit vectors as these are used to model storage elements like [[Flip-flop (electronics)|flip-flops]], hardware busses and hardware signals in general. In hardware verification languages such as OpenVera, [[e (verification language)|''e'']] and [[SystemVerilog]], bit vectors are used to sample values from the hardware models, and to represent data that is transferred to hardware during simulations. |
|||
[[Common Lisp]] provides a one-dimensional <code>bit-vector</code> implementation as a special case of the built-in <code>array</code>, acting in a dual capacity as a class and a type specifier.<ref>{{cite web|url=http://www.lispworks.com/documentation/HyperSpec/Body/t_bt_vec.htm|title=CLHS: System Class BIT-VECTOR|website=www.lispworks.com}}</ref> Being a derivative of the array, it relies on the general <code>make-array</code> function to be configured with an element type of <code>bit</code>, which optionally permits the bit vector to be designated as dynamically resizable. The <code>bit-vector</code>, however, is not infinite in extent. A more restricted <code>simple-bit-vector</code> type exists, which explicitly excludes the dynamic characteristics.<ref>{{cite web|url=http://www.lispworks.com/documentation/HyperSpec/Body/t_smp_bt.htm|title=CLHS: Type SIMPLE-BIT-VECTOR|website=www.lispworks.com}}</ref> Bit vectors are represented as, and can be constructed in a more concise fashion by, the ''reader macro'' <code>#*<i>bits</i></code>.<ref>{{cite web|url=http://www.lispworks.com/documentation/HyperSpec/Body/02_dhd.htm|title=CLHS: Section 2.4.8.4|website=www.lispworks.com}}</ref> In addition to the general functions applicable to all arrays, dedicated operations exist for bit vectors. Single bits may be accessed and modified using the <code>bit</code> and <code>sbit</code> functions<ref>{{cite web|url=http://www.lispworks.com/documentation/HyperSpec/Body/f_bt_sb.htm|title=CLHS: Accessor BIT, SBIT|website=www.lispworks.com}}</ref> and an extensive number of logical operations is supported.<ref>{{cite web|url=http://www.lispworks.com/documentation/HyperSpec/Body/f_bt_and.htm|title=CLHS: Function BIT-AND, BIT-ANDC1, BIT-ANDC2...|website=www.lispworks.com}}</ref> |
|||
==See also== |
==See also== |
||
* [[ |
* [[Arithmetic logic unit]] |
||
* [[Binary code]] |
|||
* [[Binary numeral system]] |
|||
* [[Bitboard]] Chess and similar games. |
* [[Bitboard]] Chess and similar games. |
||
* [[Bit field]] |
|||
* [[Bitmap index]] |
* [[Bitmap index]] |
||
* [[Binary numeral system]] |
|||
* [[Bitstream]] |
* [[Bitstream]] |
||
* [[Finite field]] of 2 elements, or [[GF(2)]] |
|||
* [[Judy_array]] |
|||
* [[Judy array]] |
|||
* [[Variable-length code]] |
|||
==References== |
==References== |
||
{{Reflist | refs = |
{{Reflist | refs = |
||
<ref name=" |
<ref name="linux">{{cite web|url=https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html|title=Linux Magic System Request Key Hacks|publisher=[[Kernel.org]]}}</ref> |
||
<ref name="c++">{{cite web|url=http://www.sgi.com/tech/stl/bitset.html|title=SGI.com Tech Archive Resources now retired|date=2 January 2018|publisher=[[Silicon Graphics|SGI]]}}</ref> |
|||
<ref name="boost">[http://www.boost.org/libs/dynamic_bitset/dynamic_bitset.html boost::dynamic_bitset]</ref> |
|||
<ref name="boost">{{cite web|url=http://www.boost.org/libs/dynamic_bitset/dynamic_bitset.html|title=dynamic_bitset<Block, Allocator> - 1.66.0|website=www.boost.org}}</ref> |
|||
}} |
}} |
||
==External links== |
==External links== |
||
* [http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz mathematical bases] by Pr. D.E.Knuth |
* [http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz mathematical bases] {{Webarchive|url=https://web.archive.org/web/20191016044101/http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz |date=2019-10-16 }} by Pr. D.E.Knuth |
||
* [http://www.gotw.ca/publications/N1185.pdf vector<bool> Is Nonconforming, and Forces Optimization Choice] |
|||
* [http://pypi.python.org/pypi/bitarray bitarray module] for Python |
|||
* [http://www.gotw.ca/publications/ |
* [http://www.gotw.ca/publications/N1211.pdf vector<bool>: More Problems, Better Solutions] |
||
* [http://www.gotw.ca/publications/N1211.pdf vector<bool>: More Problems, Better Solutions] |
|||
* [https://github.com/noporpoise/BitArray BitArray library] for C |
|||
{{Data structures}} |
{{Data structures}} |
Latest revision as of 01:28, 24 September 2024
This article needs additional citations for verification. (December 2010) |
A bit array (also known as bitmask,[1] bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation.
Definition
[edit]A bit array is a mapping from some domain (almost always a range of integers) to values in the set {0, 1}. The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored in one bit. As with other arrays, the access to a single bit can be managed by applying an index to the array. Assuming its size (or length) to be n bits, the array can be used to specify a subset of the domain (e.g. {0, 1, 2, ..., n−1}), where a 1-bit indicates the presence and a 0-bit the absence of a number in the set. This set data structure uses about n/w words of space, where w is the number of bits in each machine word. Whether the least significant bit (of the word) or the most significant bit indicates the smallest-index number is largely irrelevant, but the former tends to be preferred (on little-endian machines).
A finite binary relation may be represented by a bit array called a logical matrix. In the calculus of relations, these arrays are composed with matrix multiplication where the arithmetic is Boolean, and such a composition represents composition of relations.[2]
Basic operations
[edit]Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular:
Use OR
to set a bit to one:
11101010 OR 00000100 = 11101110
AND
to set a bit to zero:
11101010 AND 11111101 = 11101000
AND
to determine if a bit is set, by zero-testing:
11101010 11101010 AND 00000001 AND 00000010 = 00000000 = 00000010 (=0 ∴ bit isn't set) (≠0 ∴ bit is set)
XOR
to invert or toggle a bit:
11101010 11101110 XOR 00000100 XOR 00000100 = 11101110 = 11101010
NOT
to invert all bits:
NOT 10110010 = 01001101
To obtain the bit mask needed for these operations, we can use a bit shift operator to shift the number 1 to the left by the appropriate number of places, as well as bitwise negation if necessary.
Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either:
for i from 0 to n/w-1 complement_a[i] := not a[i] union[i] := a[i] or b[i] intersection[i] := a[i] and b[i] difference[i] := a[i] and (not b[i])
If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Only n/w memory accesses are required:
for i from 0 to n/w-1 index := 0 // if needed word := a[i] for b from 0 to w-1 value := word and 1 ≠ 0 word := word shift right 1 // do something with value index := index + 1 // if needed
Both of these code samples exhibit ideal locality of reference, which will subsequently receive large performance boost from a data cache. If a cache line is k words, only about n/wk cache misses will occur.
More complex operations
[edit]As with character strings it is straightforward to define length, substring, lexicographical compare, concatenation, reverse operations. The implementation of some of these operations is sensitive to endianness.
Population / Hamming weight
[edit]If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar. See the Hamming weight article for examples of an efficient implementation.
Inversion
[edit]Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, requires flipping the bits of individual words (so b31 b30 ... b0
becomes b0 ... b30 b31
).
When this operation is not available on the processor, it's still possible to proceed by successive passes, in this example on 32 bits:
exchange two 16-bit halfwords
exchange bytes by pairs (0xddccbbaa -> 0xccddaabb)
...
swap bits by pairs
swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1)
The last operation can be written ((x&0x55555555) << 1) | (x&0xaaaaaaaa) >> 1)).
Find first one
[edit]The find first set or find first one operation identifies the index or position of the 1-bit with the smallest index in an array, and has widespread hardware support (for arrays not larger than a word) and efficient algorithms for its computation. When a priority queue is stored in a bit array, find first one can be used to identify the highest priority element in the queue. To expand a word-size find first one to longer arrays, one can find the first nonzero word and then run find first one on that word. The related operations find first zero, count leading zeros, count leading ones, count trailing zeros, count trailing ones, and log base 2 (see find first set) can also be extended to a bit array in a straightforward manner.
Compression
[edit]A bit array is the most dense storage for "random" bits, that is, where each bit is equally likely to be 0 or 1, and each one is independent. But most data are not random, so it may be possible to store it more compactly. For example, the data of a typical fax image is not random and can be compressed. Run-length encoding is commonly used to compress these long streams. However, most compressed data formats are not so easy to access randomly; also by compressing bit arrays too aggressively we run the risk of losing the benefits due to bit-level parallelism (vectorization). Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams of bytes or words (see Bitmap index (compression)).
Advantages and disadvantages
[edit]Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems:
- They are extremely compact; no other data structures can store n independent pieces of data in n/w words.
- They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses.
- Because of their ability to exploit bit-level parallelism, limit memory access, and maximally use the data cache, they often outperform many other data structures on practical data sets, even those that are more asymptotically efficient.
However, bit arrays are not the solution to everything. In particular:
- Without compression, they are wasteful set data structures for sparse sets (those with few elements compared to their range) in both time and space. For such applications, compressed bit arrays, Judy arrays, tries, or even Bloom filters should be considered instead.
- Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing.
Applications
[edit]Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of Boolean flags or an ordered sequence of Boolean values.
Bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware.
Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel.
Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistic hash tables based on bit arrays that accept either false positives or false negatives.
Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become important.
Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long.
In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only normally selected when −log(2 − p) / log(1 − p) ≤ 1, or roughly the term occurs in at least 38% of documents.
Language support
[edit]The APL programming language fully supports bit arrays of arbitrary shape and size as a Boolean datatype distinct from integers. All major implementations (Dyalog APL, APL2, APL Next, NARS2000, Gnu APL, etc.) pack the bits densely into whatever size the machine word is. Bits may be accessed individually via the usual indexing notation (A[3]) as well as through all of the usual primitive functions and operators where they are often operated on using a special case algorithm such as summing the bits via a table lookup of bytes.
The C programming language's bit fields, pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bytewise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in the X11 system, xtrapbits.h, is “a portable way for systems to define bit field manipulation of arrays of bits.” A more explanatory description of aforementioned approach can be found in the comp.lang.c faq.
In C++, although individual bool
s typically occupy the same space as a byte or an integer, the STL type vector<bool>
is a partial template specialization in which bits are packed as a space efficiency optimization. Since bytes (and not bits) are the smallest addressable unit in C++, the [] operator does not return a reference to an element, but instead returns a proxy reference. This might seem a minor point, but it means that vector<bool>
is not a standard STL container, which is why the use of vector<bool>
is generally discouraged. Another unique STL class, bitset
,[3] creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. The Boost C++ Libraries provide a dynamic_bitset
class[4] whose size is specified at run-time.
The D programming language provides bit arrays in its standard library, Phobos, in std.bitmanip
. As in C++, the [] operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead returns a bool
.
In Java, the class BitSet
creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike the bitset
in C++, the Java BitSet
does not have a "size" state (it has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a class EnumSet
, which represents a Set of values of an enumerated type internally as a bit vector, as a safer alternative to bit fields.
The .NET Framework supplies a BitArray
collection class. It stores bits using an array of type int
(each element in the array usually represents 32 bits).[5] The class supports random access and bitwise operators, can be iterated over, and its Length
property can be changed to grow or truncate it.
Although Standard ML has no support for bit arrays, Standard ML of New Jersey has an extension, the BitArray
structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, unusually, shift operations.
Haskell likewise currently lacks standard support for bitwise operations, but both GHC and Hugs provide a Data.Bits
module with assorted bitwise functions and operators, including shift and rotate operations and an "unboxed" array over Boolean values may be used to model a Bit array, although this lacks support from the former module.
In Perl, strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (~ | & ^
),[6] and individual bits can be tested and set using the vec function.[7]
In Ruby, you can access (but not set) a bit of an integer (Fixnum
or Bignum
) using the bracket operator ([]
), as if it were an array of bits.
Apple's Core Foundation library contains CFBitVector and CFMutableBitVector structures.
PL/I supports arrays of bit strings of arbitrary length, which may be either fixed-length or varying. The array elements may be aligned— each element begins on a byte or word boundary— or unaligned— elements immediately follow each other with no padding.
PL/pgSQL and PostgreSQL's SQL support bit strings as native type. There are two SQL bit types: bit(
and n
)bit varying(
, where n
)n
is a positive integer.[8]
Hardware description languages such as VHDL, Verilog, and SystemVerilog natively support bit vectors as these are used to model storage elements like flip-flops, hardware busses and hardware signals in general. In hardware verification languages such as OpenVera, e and SystemVerilog, bit vectors are used to sample values from the hardware models, and to represent data that is transferred to hardware during simulations.
Common Lisp provides a one-dimensional bit-vector
implementation as a special case of the built-in array
, acting in a dual capacity as a class and a type specifier.[9] Being a derivative of the array, it relies on the general make-array
function to be configured with an element type of bit
, which optionally permits the bit vector to be designated as dynamically resizable. The bit-vector
, however, is not infinite in extent. A more restricted simple-bit-vector
type exists, which explicitly excludes the dynamic characteristics.[10] Bit vectors are represented as, and can be constructed in a more concise fashion by, the reader macro #*bits
.[11] In addition to the general functions applicable to all arrays, dedicated operations exist for bit vectors. Single bits may be accessed and modified using the bit
and sbit
functions[12] and an extensive number of logical operations is supported.[13]
See also
[edit]- Arithmetic logic unit
- Binary code
- Binary numeral system
- Bitboard Chess and similar games.
- Bit field
- Bitmap index
- Bitstream
- Finite field of 2 elements, or GF(2)
- Judy array
- Variable-length code
References
[edit]- ^ "Linux Magic System Request Key Hacks". Kernel.org.
- ^ Irving Copilowish (December 1948) "Matrix development of the calculus of relations", Journal of Symbolic Logic 13(4): 193–203 Jstor link
- ^ "SGI.com Tech Archive Resources now retired". SGI. 2 January 2018.
- ^ "dynamic_bitset<Block, Allocator> - 1.66.0". www.boost.org.
- ^ ".NET mscorlib source code". github.com/microsoft. 15 October 2021.
- ^ "perlop - perldoc.perl.org". perldoc.perl.org.
- ^ "vec - perldoc.perl.org". perldoc.perl.org.
- ^ "8.10. Bit String Types". 30 September 2021.
- ^ "CLHS: System Class BIT-VECTOR". www.lispworks.com.
- ^ "CLHS: Type SIMPLE-BIT-VECTOR". www.lispworks.com.
- ^ "CLHS: Section 2.4.8.4". www.lispworks.com.
- ^ "CLHS: Accessor BIT, SBIT". www.lispworks.com.
- ^ "CLHS: Function BIT-AND, BIT-ANDC1, BIT-ANDC2..." www.lispworks.com.
External links
[edit]- mathematical bases Archived 2019-10-16 at the Wayback Machine by Pr. D.E.Knuth
- vector<bool> Is Nonconforming, and Forces Optimization Choice
- vector<bool>: More Problems, Better Solutions