Jump to content

Ascii85: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
Tags: references removed Mobile edit Mobile web edit
RFC 1924 version: Make it more clear that this section is describing a joke and not a serious encoding
(42 intermediate revisions by 36 users not shown)
Line 1: Line 1:
more footnotes date=March 2013 Ascii85, also called Base85, is a form ofbinary-to-text encoding]l developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data making the encoded size larger than the original, assuming eight bits per ASCII character, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data increase, assuming eight bits
{{short description|Form of binary-to-text encoding developed by Paul E. Rutter}}
'''Ascii85''', also called '''Base85''', is a form of [[binary-to-text encoding]] developed by Paul E. Rutter for the [[btoa]] utility. By using five [[ASCII]] characters to represent four bytes of [[binary data]] (making the encoded size {{1/4}} larger than the original, assuming eight bits per ASCII character), it is more efficient than [[uuencode]] or [[Base64]], which use four characters to represent three bytes of data ({{1/3}} increase, assuming eight bits per ASCII character).
Its main modern uses Adobe Systems's PostScript and Portable Document Format file formats, as well as in the patch Unix patch encoding for binary files used by Git.www.gelato.unsw.edu.au/archivesgit/0605/19975.html|title=binary patch author Junio Hamano date=May 5, 2006


Its main modern uses are in [[Adobe Systems|Adobe]]'s [[PostScript]] and [[Portable Document Format]] file formats, as well as in the [[patch (Unix)|patch]] encoding for [[binary file]]s used by [[Git]].<ref>{{cite web |author=Hamano |first=Junio C |author-link=Junio Hamano |date=May 5, 2006 |title=[PATCH] binary patch. |url=http://www.gelato.unsw.edu.au/archives/git/0605/19975.html |url-status=dead |archive-url=https://web.archive.org/web/20200726102316/http://www.gelato.unsw.edu.au/archives/git/0605/19975.html |archive-date=2020-07-26 |website=git}}</ref>
==Basic idea==
The basic need for a binary-to-text encoding comes from a need to communicate arbitrary [[binary data]] over preexisting [[communications protocol]]s that were designed to carry only English language [[human-readable]] text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require [[line break (computing)|line break]]s at certain maximum intervals, and may not maintain [[whitespace (computer science)|whitespace]]. Thus, only the 95 [[ASCII#ASCII printable characters|printable ASCII characters]] are "safe" to use to convey data.


==Overview==
Four bytes can represent 2<sup>32</sup>&nbsp;= 4,294,967,296 possible values. Five [[radix]]-85 digits provide 85<sup>5</sup>&nbsp;= 4,437,053,125 possible values, enough to provide for a unique representation for each possible 32-bit value. Because five radix-84 digits only provide 84<sup>5</sup>&nbsp;=&nbsp;4,182,119,424 representable values, 85 is the minimum possible integral base that will represent four bytes in five characters, hence its choice.
The basic need for a binary-to-text encoding comes from a need to communicate arbitrary [[binary data]] over preexisting [[communications protocol]]s that were designed to carry only English language [[human-readable]] text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require [[line break (computing)|line break]]s at certain maximum intervals, and may not maintain [[whitespace (computer science)|whitespace]]. Thus, only the 94 [[ASCII#ASCII printable characters|printable ASCII characters]] are "safe" to use to convey data.


Eighty-five is the minimum integer value of ''n'' such that {{nobr|''n''<sup>5</sup> ≥ 256<sup>4</sup>}}; so ''any'' sequence of 4 bytes can be encoded as 5 symbols, as long as at least 85 distinct symbols are available. (Five [[radix]]-85 digits can represent the integers from 0 to 4,437,053,124 inclusive, which suffice to represent all 4,294,967,296 possible 4-byte sequences.)
When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a [[big-endian]] convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 ("<code>!</code>") through 117 ("<code>u</code>").


== Encoding ==
Because all-zero data is quite common, an exception is made for the sake of [[data compression]], and an all-zero group is encoded as a single character "<code>z</code>" instead of "<code>!!!!!</code>".
{{Unreferenced section|date=March 2023}}When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a [[big-endian]] convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 (<code>!</code>) through 117 (<code>u</code>).


Because all-zero data is quite common, an exception is made for the sake of [[data compression]], and an all-zero group is encoded as a single character <code>z</code> instead of <code>!!!!!</code>.
Groups of characters that decode to a value greater than {{nobr|2<sup>32</sup> − 1}} (encoded as "<code>s8W-!</code>") will cause a decoding error, as will "<code>z</code>" characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.


Groups of characters that decode to a value greater than {{nobr|2<sup>32</sup> − 1}} (encoded as <code>s8W-!</code>) will cause a decoding error, as will <code>z</code> characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.
One disadvantage of Ascii85 is that encoded data may contain [[escape character]]s such as backslash and quote, which have special meaning in many programming languages and in some text-based protocols. Other base-85 encodings like Z85 are designed to be safe in source code.<ref>[http://rfc.zeromq.org/spec:32 "Z85 - ZeroMQ Base-85 Encoding Algorithm"]</ref>

==Limitations==
The original specification only allows a stream that is a multiple of 4 bytes to be encoded.

Encoded data may contain [[escape character|character]]s that have special meaning in many programming languages and in some text-based protocols, such as left-angle-bracket <code>&lt;</code>, backslash <code>\</code>, and the single and double quotes <code>'</code> & <code>"</code>. Other base-85 encodings like Z85 and {{IETF RFC|1924}} are designed to be safe in source code.<ref>[http://rfc.zeromq.org/spec:32 "32/Z85"] on ZeroMQ RFC</ref>


==History==
==History==


===btoa version===
===btoa version===
{{redirect|btoa|the JavaScript <code>btoa()</code> function|Base64}}
The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and [[hexadecimal]]) and three 32-bit [[checksum]]s. The decoder needs to use the file length to see how much of the group was padding. The initial proposal for btoa encoding used an encoding alphabet starting at the ASCII space character through "t" inclusive, but this was replaced with an encoding alphabet of "!" to "u" to avoid "problems with some mailers (stripping off trailing blanks)."<ref>{{cite web|last1=Orost|first1=Joe|title=Re: COMPRESSING of binary data into mailable ASCII Re: Encoding of binary data into mailable ASCII|url=https://groups.google.com/forum/#!original/comp.compression/Ve7k8XF-F5k/gBWfpyL-gfgJ|website=Google Groups|accessdate=11 April 2015}}</ref> This program also introduced the special "<code>z</code>" short form for an all-zero group. Version 4.2 added a "<code>y</code>" exception for a group of all ASCII [[space (punctuation)|space]] characters (0x20202020).
The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and [[hexadecimal]]) and three 32-bit [[checksum]]s. The decoder needs to use the file length to see how much of the group was padding. The initial proposal for btoa encoding used an encoding alphabet starting at the ASCII space character through "t" inclusive, but this was replaced with an encoding alphabet of "!" to "u" to avoid "problems with some mailers (stripping off trailing blanks)".<ref>{{cite web |last1=Orost |first1=Joe |date=Mar 26, 1991 |title=Re: COMPRESSING of binary data into mailable ASCII Re: Encoding of binary data into mailable ASCII |url=https://groups.google.com/forum/#!original/comp.compression/Ve7k8XF-F5k/gBWfpyL-gfgJ |access-date=11 April 2015 |website=Google Groups}}</ref> This program also introduced the special "<code>z</code>" short form for an all-zero group. Version 4.2 added a "<code>y</code>" exception for a group of all ASCII [[space (punctuation)|space]] characters (0x20202020).


===ZMODEM version===
===ZMODEM version===
"ZMODEM Pack-7 encoding" encodes groups of 4 octets into groups of 5 printable ASCII characters, similar to Ascii85 (or perhaps exactly the same?). When [[ZMODEM]] programs send pre-compressed 8-bit data files over [[8-bit clean|7-bit data channels]], it uses "ZMODEM Pack-7 encoding".<ref>Chuck Forsberg. {{webarchive |url=https://web.archive.org/web/20150924060127/http://www.omen.com/zmdmwn.html |title="Recent Developments in ZMODEM"}}. "ZMODEM Pack-7 packs 4 bytes into 5 printing characters."</ref>
"ZMODEM Pack-7 encoding" encodes groups of 4 octets into groups of 5 printable ASCII characters in a similar, or possibly in the same way as Ascii85 does. When a [[ZMODEM]] program sends pre-compressed 8-bit data files over [[8-bit clean|7-bit data channels]], it uses "ZMODEM Pack-7 encoding".<ref>Chuck Forsberg. {{Cite web |title=Recent Developments in ZMODEM |url=http://www.omen.com/zmdmwn.html |url-status=dead |archive-url=https://web.archive.org/web/20150924060127/http://www.omen.com/zmdmwn.html |archive-date=2015-09-24 |access-date=2013-05-14 |website=omen.com}}. "ZMODEM Pack-7 packs 4 bytes into 5 printing characters."</ref>


===Adobe version===
===Adobe version===
Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. The characters used are the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value), and white space is ignored. Adobe uses the delimiter "<code>~></code>" to mark the end of an Ascii85-encoded string, and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to three null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.
Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. The characters used are the ASCII characters 33 (<code>!</code>) through 117 (<code>u</code>) inclusive (to represent the base-85 digits 0 through 84), together with the letter <code>z</code> (as a special case to represent a 32-bit 0&nbsp;value), and white space is ignored. Adobe uses the delimiter "<code>~></code>" to mark the end of an Ascii85-encoded string and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to 3 null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.


The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character "<code>u</code>", and as many bytes as were added as padding are omitted from the end of the output (see example).
The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character <code>u</code>, and as many bytes as were added as padding are omitted from the end of the output (see example).


NOTE: The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is ''not'' a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with 'u's) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).
The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is ''not'' a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with <code>u</code>s) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).


<!-- TODO: Wikify and summarize the following paragraphs. This is a nice explanation, but the style does not fit Wikipedia, and it is way too long compared to the rest of the article. Ideally we could link to an explanation like this outside of Wikipedia. -->
<!-- TODO: Wikify and summarize the following paragraphs. This is a nice explanation, but the style does not fit Wikipedia, and it is way too long compared to the rest of the article. Ideally we could link to an explanation like this outside of Wikipedia. -->
Line 202: Line 209:


So ... the padding for decoding has to be matched with the padding for encoding in base85 since high bits affect low base85 digits and conversely - which does not occur with base64 encoding.
So ... the padding for decoding has to be matched with the padding for encoding in base85 since high bits affect low base85 digits and conversely - which does not occur with base64 encoding.

-->
-->

In Ascii85-encoded blocks, whitespace and line-break characters may be present anywhere, including in the middle of a 5-character block, but they must be silently ignored.
In Ascii85-encoded blocks, whitespace and line-break characters may be present anywhere, including in the middle of a 5-character block, but they must be silently ignored.


Adobe's specification does not support the "<code>y</code>" exception.
Adobe's specification does not support the <code>y</code> exception.

===ZeroMQ Version (Z85)===
Z85, the [[ZeroMQ]] base-85 encoding algorithm, is a string-safe variant of base85. By avoiding the double-quote, single-quote, and backslash characters, Z85-encoded data can be better embedded in [[command-line interpreter]] strings. Z85 uses the characters <tt>0</tt>...<tt>9</tt>, <tt>a</tt>...<tt>z</tt>, <tt>A</tt>...<tt>Z</tt>, <tt>.</tt>, <tt>-</tt>, <tt>:</tt>, <tt>+</tt>, <tt>=</tt>, <tt>^</tt>, <tt>!</tt>, <tt>/</tt>, <tt>*</tt>, <tt>?</tt>, <tt>&</tt>, <tt>&lt;</tt>, <tt>&gt;</tt>, <tt>(</tt>, <tt>)</tt>, <tt>&#91;</tt>, <tt>&#93;</tt>, <tt>&#123;</tt>, <tt>&#125;</tt>, <tt>@</tt>, <tt>%</tt>, <tt>$</tt>, <tt>#</tt>.<ref>Pieter Hintjens [http://rfc.zeromq.org/spec:32 RFC 32/Z85 - ZeroMQ Base-85 Encoding Algorithm]</ref>


===Example for Ascii85===
===Example for Ascii85===
Line 221: Line 225:


<pre>
<pre>
<~9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,
9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,O<
O<DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKY
DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKYi(
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIa
DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIal(
l(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G
DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G>u
>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c~>
D.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c
</pre>
</pre>


Line 235: Line 239:
| colspan="8" align="center"| ''' '''
| colspan="8" align="center"| ''' '''
| align="center"| ...
| align="center"| ...
| colspan="8" align="center"| '''s'''
| colspan="8" align="center"| '''u'''
| colspan="8" align="center"| '''r'''
| colspan="8" align="center"| '''e'''
|-
|-
! ASCII
! ASCII
Line 246: Line 246:
| colspan="8" align="center"| 32
| colspan="8" align="center"| 32
| align="center"| ...
| align="center"| ...
| colspan="8" align="center"| 115
| colspan="8" align="center"| 117
| colspan="8" align="center"| 114
| colspan="8" align="center"| 101
|-
|-
! Bit pattern
! Bit pattern
|0||1||0||0||1||1||0||1||0||1||1||0||0||0||0||1||0||1||1||0||1||1||1||0||0||0||1||0||0||0||0||0|0
|0||1||0||0||1||1||0||1||0||1||1||0||0||0||0||1||0||1||1||0||1||1||1||0||0||0||1||0||0||0||0||0|0
| align="center"| ...
| align="center"| ...
||0||1||1||1||0||0||1||1||0||1||1||1||0||1||0||1||0||1||1||1||0||0||1||0||0||1||1||0||0||1||0||1
|-
|-
! 32-bit Value
! 32-bit Value
| colspan="32" align="center"| 1,298,230,816 = 24×85<sup>4</sup> + 73×85<sup>3</sup> + 80×85<sup>2</sup> + 78×85 + 61
| colspan="32" align="center"| 1,298,230,816 = 24×85<sup>4</sup> + 73×85<sup>3</sup> + 80×85<sup>2</sup> + 78×85 + 61
| align="center"| ...
| align="center"| ...
| colspan="32" align="center"| 1,937,076,837 = 37×85<sup>4</sup> + 9×85<sup>3</sup> + 17×85<sup>2</sup> + 44×85 + 22
|-
|-
! Base 85 (+33)
! Base 85 (+33)
Line 268: Line 262:
| colspan="6" align="center"| 61 (94)
| colspan="6" align="center"| 61 (94)
| align="center"| ...
| align="center"| ...
| colspan="6" align="center"| 37 (70)
| colspan="7" align="center"| 9 (42)
| colspan="6" align="center"| 17 (50)
| colspan="7" align="center"| 44 (77)
| colspan="6" align="center"| 22 (55)
|-
|-
! ASCII
! ASCII
Line 281: Line 270:
| colspan="6" align="center"| ^
| colspan="6" align="center"| ^
| align="center"| ...
| align="center"| ...
|}

{| class="wikitable"
! Text content
| colspan="8" align="center"| '''s'''
| colspan="8" align="center"| '''u'''
| colspan="8" align="center"| '''r'''
| colspan="8" align="center"| '''e'''
|-
! ASCII
| colspan="8" align="center"| 115
| colspan="8" align="center"| 117
| colspan="8" align="center"| 114
| colspan="8" align="center"| 101
|-
! Bit pattern
||0||1||1||1||0||0||1||1||0||1||1||1||0||1||0||1||0||1||1||1||0||0||1||0||0||1||1||0||0||1||0||1
|-
! 32-bit Value
| colspan="32" align="center"| 1,937,076,837 = 37×85<sup>4</sup> + 9×85<sup>3</sup> + 17×85<sup>2</sup> + 44×85 + 22
|-
! Base 85 (+33)
| colspan="6" align="center"| 37 (70)
| colspan="7" align="center"| 9 (42)
| colspan="6" align="center"| 17 (50)
| colspan="7" align="center"| 44 (77)
| colspan="6" align="center"| 22 (55)
|-
! ASCII
| colspan="6" align="center"| F
| colspan="6" align="center"| F
| colspan="7" align="center"| *
| colspan="7" align="center"| *
Line 287: Line 305:
| colspan="6" align="center"| 7
| colspan="6" align="center"| 7
|}
|}



Since the last 4-tuple is incomplete, it must be padded with three zero bytes:
Since the last 4-tuple is incomplete, it must be padded with three zero bytes:
Line 318: Line 337:
| colspan="6" align="center"| /
| colspan="6" align="center"| /
| colspan="7" align="center"| c
| colspan="7" align="center"| c
| colspan="6" align="center"| ''Y''
| colspan="6" align="center"| <s>Y</s>
| colspan="7" align="center"| ''k''
| colspan="7" align="center"| <s>k</s>
| colspan="6" align="center"| ''O''
| colspan="6" align="center"| <s>O</s>
|}
|}


Line 356: Line 375:
| colspan="8" align="center"| '''.'''
| colspan="8" align="center"| '''.'''
| colspan="8" align="center"| ''[ [[End-of-text character|ETX]] ]''
| colspan="8" align="center"| ''[ [[End-of-text character|ETX]] ]''
| colspan="8" align="center"| ''[ EM ]''
| colspan="8" align="center"| ''[ [[End of Medium|EM]] ]''
| colspan="8" align="center"| ''&#180; ([[Extended ASCII]])''
| colspan="8" align="center"| ''´ ([[Extended ASCII]])''
|}
|}


Line 367: Line 386:
The Ascii85 encoding is compatible with 7-bit and 8-bit [[MIME]], while having less overhead than [[Base64]].
The Ascii85 encoding is compatible with 7-bit and 8-bit [[MIME]], while having less overhead than [[Base64]].


One potential compatibility issue of Ascii85 is that 'single' and "double" quotation marks, <angle> brackets, and ampersands (&) cannot be used unescaped in markup languages like XML or SGML.
One potential compatibility issue of Ascii85 is that some of the characters it uses are significant in markup languages such as [[XML]] or [[SGML]]. To include ascii85 data in these documents, it may be necessary to escape the [[quotation mark|quote]], [[bracket#Angle brackets in programming languages|angle brackets]], and [[ampersand]]s.


==<nowiki>RFC 1924</nowiki> version==
==<nowiki>RFC 1924</nowiki> version==
Published on [[April Fools' Day Request for Comments|April 1, 1996]], informational {{IETF RFC|1924}}: "A Compact Representation of IPv6 Addresses" by [[Kevin Robert Elz|Robert Elz]] suggests a base-85 encoding of [[IPv6]] addresses. This differs from the scheme used above in that he proposes a different set of 85 ASCII characters, and proposes to do all arithmetic on the 128-bit number, converting it to a single 20-digit base-85 number (internal whitespace not allowed), rather than breaking it into four 32-bit groups.
Published on [[April Fools' Day Request for Comments|April 1, 1996]], informational {{IETF RFC|1924}}: "A Compact Representation of IPv6 Addresses" by [[Kevin Robert Elz|Robert Elz]] suggests a base-85 encoding of [[IPv6]] addresses as an [[April Fools' Day]] joke. This differs from the scheme used above in that he proposes a different set of 85 ASCII characters, and proposes to do all arithmetic on the 128-bit number, converting it to a single 20-digit base-85 number (internal whitespace not allowed), rather than breaking it into four 32-bit groups.


The proposed character set is, in order, <code>0</code>–<code>9</code>, <code>A</code>–<code>Z</code>, <code>a</code>–<code>z</code>, and then the 23 characters <code>!#$%&amp;()*+-;&lt;=&gt;?@^_`{|}~</code>. The highest possible representable address, 2<sup>128</sup>−1&nbsp;= 74×85<sup>19</sup>&nbsp;+ 53×85<sup>18</sup>&nbsp;+ 5×85<sup>17</sup>&nbsp;+ ..., would be encoded as <code>=r54lj&amp;NUUO~Hi%c2ym0</code>.
The proposed character set is, in order, <code>0</code>–<code>9</code>, <code>A</code>–<code>Z</code>, <code>a</code>–<code>z</code>, and then the 23 characters <code>!#$%&amp;()*+-;&lt;=&gt;?@^_`{|}~</code>. The highest possible representable address, 2<sup>128</sup>−1&nbsp;= 74×85<sup>19</sup>&nbsp;+ 53×85<sup>18</sup>&nbsp;+ 5×85<sup>17</sup>&nbsp;+ ..., would be encoded as <code>=r54lj&amp;NUUO~Hi%c2ym0</code>.


This character set excludes the characters <code>"',./:[\]&nbsp;</code>, making it suitable for use in [[JSON]] strings (where <code>"</code> and <code>\</code> would require escaping). However, for [[SGML]]-based protocols, notably including [[XML]], string escapes may still be required (to accommodate <code>&lt;</code>, <code>&gt;</code> and <code>&amp;</code>).
This character set excludes the characters <code>"',./:[\]&nbsp;</code>, making it suitable for use in [[JSON]] strings (where <code>"</code> and <code>\</code> would require escaping). However, for SGML-based protocols, notably including XML, string escapes may still be required (to accommodate <code>&lt;</code>, <code>&gt;</code> and <code>&amp;</code>).


==See also==
==See also==
Line 384: Line 403:


==References==
==References==
{{Reflist}}
<references />


==External links==
==External links==
*[http://base91.sourceforge.net/ BasE91]
*[http://base91.sourceforge.net/ basE91]
*[https://www.adobe.com/products/postscript/pdfs/PLRM.pdf PostScript Language Reference] (Adobe) - see ASCII85Encode Filter
*[https://web.archive.org/web/20161222092741/https://www.adobe.com/products/postscript/pdfs/PLRM.pdf PostScript Language Reference] (Adobe) - see ASCII85Encode Filter


{{Data Exchange}}
{{Data Exchange}}

Revision as of 00:21, 4 July 2024

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data (making the encoded size 14 larger than the original, assuming eight bits per ASCII character), it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data (13 increase, assuming eight bits per ASCII character).

Its main modern uses are in Adobe's PostScript and Portable Document Format file formats, as well as in the patch encoding for binary files used by Git.[1]

Overview

The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocols that were designed to carry only English language human-readable text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain whitespace. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.

Eighty-five is the minimum integer value of n such that n5 ≥ 2564; so any sequence of 4 bytes can be encoded as 5 symbols, as long as at least 85 distinct symbols are available. (Five radix-85 digits can represent the integers from 0 to 4,437,053,124 inclusive, which suffice to represent all 4,294,967,296 possible 4-byte sequences.)

Encoding

When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a big-endian convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 (!) through 117 (u).

Because all-zero data is quite common, an exception is made for the sake of data compression, and an all-zero group is encoded as a single character z instead of !!!!!.

Groups of characters that decode to a value greater than 232 − 1 (encoded as s8W-!) will cause a decoding error, as will z characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.

Limitations

The original specification only allows a stream that is a multiple of 4 bytes to be encoded.

Encoded data may contain characters that have special meaning in many programming languages and in some text-based protocols, such as left-angle-bracket <, backslash \, and the single and double quotes ' & ". Other base-85 encodings like Z85 and RFC 1924 are designed to be safe in source code.[2]

History

btoa version

The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and hexadecimal) and three 32-bit checksums. The decoder needs to use the file length to see how much of the group was padding. The initial proposal for btoa encoding used an encoding alphabet starting at the ASCII space character through "t" inclusive, but this was replaced with an encoding alphabet of "!" to "u" to avoid "problems with some mailers (stripping off trailing blanks)".[3] This program also introduced the special "z" short form for an all-zero group. Version 4.2 added a "y" exception for a group of all ASCII space characters (0x20202020).

ZMODEM version

"ZMODEM Pack-7 encoding" encodes groups of 4 octets into groups of 5 printable ASCII characters in a similar, or possibly in the same way as Ascii85 does. When a ZMODEM program sends pre-compressed 8-bit data files over 7-bit data channels, it uses "ZMODEM Pack-7 encoding".[4]

Adobe version

Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. The characters used are the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value), and white space is ignored. Adobe uses the delimiter "~>" to mark the end of an Ascii85-encoded string and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to 3 null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.

The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character u, and as many bytes as were added as padding are omitted from the end of the output (see example).

The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is not a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with us) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).

In Ascii85-encoded blocks, whitespace and line-break characters may be present anywhere, including in the middle of a 5-character block, but they must be silently ignored.

Adobe's specification does not support the y exception.

Example for Ascii85

A quote from Thomas Hobbes's Leviathan:

Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.

If this is initially encoded using US-ASCII, it can be reencoded in Ascii85 as follows:

9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,O<
DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKYi(
DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIal(
DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G>u
D.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c
Text content M a n ...
ASCII 77 97 110 32 ...
Bit pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 0 0 0 ...
32-bit Value 1,298,230,816 = 24×854 + 73×853 + 80×852 + 78×85 + 61 ...
Base 85 (+33) 24 (57) 73 (106) 80 (113) 78 (111) 61 (94) ...
ASCII 9 j q o ^ ...
Text content s u r e
ASCII 115 117 114 101
Bit pattern 0 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1
32-bit Value 1,937,076,837 = 37×854 + 9×853 + 17×852 + 44×85 + 22
Base 85 (+33) 37 (70) 9 (42) 17 (50) 44 (77) 22 (55)
ASCII F * 2 M 7


Since the last 4-tuple is incomplete, it must be padded with three zero bytes:

Text content . \0 \0 \0
ASCII 46 0 0 0
Bit pattern 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
32-bit Value 771,751,936 = 14×854 + 66×853 + 56×852 + 74×85 + 46
Base 85 (+33) 14 (47) 66 (99) 56 (89) 74 (107) 46 (79)
ASCII / c Y k O

Since three bytes of padding had to be added, the three final characters 'YkO' are omitted from the output.

Decoding is done inversely, except that the last 5-tuple is padded with 'u' characters:

ASCII / c u u u
Base 85 (+33) 14 (47) 66 (99) 84 (117) 84 (117) 84 (117)
32-bit Value 771,955,124 = 14×854 + 66×853 + 84×852 + 84×85 + 84
Bit pattern 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 0 0
ASCII 46 3 25 180
Text content . [ ETX ] [ EM ] ´ (Extended ASCII)

Since the input had to be padded with three 'u' bytes, the last three bytes of the output are ignored and we end up with the original period.

The input sentence does not contain 4 consecutive zero bytes, so the example does not show the use of the 'z' abbreviation.

Compatibility

The Ascii85 encoding is compatible with 7-bit and 8-bit MIME, while having less overhead than Base64.

One potential compatibility issue of Ascii85 is that some of the characters it uses are significant in markup languages such as XML or SGML. To include ascii85 data in these documents, it may be necessary to escape the quote, angle brackets, and ampersands.

RFC 1924 version

Published on April 1, 1996, informational RFC 1924: "A Compact Representation of IPv6 Addresses" by Robert Elz suggests a base-85 encoding of IPv6 addresses as an April Fools' Day joke. This differs from the scheme used above in that he proposes a different set of 85 ASCII characters, and proposes to do all arithmetic on the 128-bit number, converting it to a single 20-digit base-85 number (internal whitespace not allowed), rather than breaking it into four 32-bit groups.

The proposed character set is, in order, 09, AZ, az, and then the 23 characters !#$%&()*+-;<=>?@^_`{|}~. The highest possible representable address, 2128−1 = 74×8519 + 53×8518 + 5×8517 + ..., would be encoded as =r54lj&NUUO~Hi%c2ym0.

This character set excludes the characters "',./:[\] , making it suitable for use in JSON strings (where " and \ would require escaping). However, for SGML-based protocols, notably including XML, string escapes may still be required (to accommodate <, > and &).

See also

References

  1. ^ Hamano, Junio C (May 5, 2006). "[PATCH] binary patch". git. Archived from the original on 2020-07-26.
  2. ^ "32/Z85" on ZeroMQ RFC
  3. ^ Orost, Joe (Mar 26, 1991). "Re: COMPRESSING of binary data into mailable ASCII Re: Encoding of binary data into mailable ASCII". Google Groups. Retrieved 11 April 2015.
  4. ^ Chuck Forsberg. "Recent Developments in ZMODEM". omen.com. Archived from the original on 2015-09-24. Retrieved 2013-05-14.. "ZMODEM Pack-7 packs 4 bytes into 5 printing characters."