Byte pair encoding: Difference between revisions
m added links |
Mboverload (talk | contribs) RegExTypoFix Typos: occurance → occurrence, using AWB |
||
Line 14: | Line 14: | ||
ZabZabac |
ZabZabac |
||
In this data the byte pair "Za" occurs most often, so it will be replaced with a byte that is not used in the data, "Y". (In this case "Za" could be replaced by "Z", since every |
In this data the byte pair "Za" occurs most often, so it will be replaced with a byte that is not used in the data, "Y". (In this case "Za" could be replaced by "Z", since every occurrence of "Z" will be replaced.) The replacement table and data are now |
||
Z <- aa |
Z <- aa |
||
Line 32: | Line 32: | ||
To decompressed the data, simply perform the replacements in reverse order. |
To decompressed the data, simply perform the replacements in reverse order. |
||
[[Category:Lossless compression algorithms]] |
[[Category:Lossless compression algorithms]] |
Revision as of 00:45, 15 August 2006
Byte pair encoding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data.
Byte pair encoding example
Suppose we wanted to encode the data
aaabaaabac
The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now we have a replacement table of
Z <- aa
and the data is
ZabZabac
In this data the byte pair "Za" occurs most often, so it will be replaced with a byte that is not used in the data, "Y". (In this case "Za" could be replaced by "Z", since every occurrence of "Z" will be replaced.) The replacement table and data are now
Z <- aa Y <- Za
YbYbac
Once again we replace the byte pair that occurs most often.
Z <- aa Y <- Za X <- Yb
XXac
This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.
To decompressed the data, simply perform the replacements in reverse order.