Jump to content

Byte pair encoding

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by MindlessXD (talk | contribs) at 13:53, 4 July 2006. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Byte pair encoding is a simple form of data compression in which the two most common bytes of data are replaced with a byte that does not occur within the data. A table of the replacements is required to rebuild the original data.

Byte pair encoding example

Suppose we wanted to encode the data

aaabaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now we have a replacement table of

Z <- aa

and the data is

ZabZabac

In this data the byte pair "Za" occurs most often, so it will be replaced with a byte that is not used in the data, "Y". (In this case "Za" could be replaced by "Z", since every occurance of "Z" will be replaced.) The replacement table and data are now

Z <- aa
Y <- Za
YbYbac

Once again we replace the byte pair that occurs most often.

Z <- aa
Y <- Za
X <- Yb
XXac

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompressed the data, simply perform the replacements in reverse order.