Jump to content

Percent-encoding: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Implement the actual intention of revision 886687341 that introduced this column: "Two currency symbols of major currencies added in order to illustrate 2 and 3 byte UTF-8 encoding"
No edit summary
Tags: Visual edit Mobile edit Mobile web edit
 
(28 intermediate revisions by 21 users not shown)
Line 1: Line 1:
{{Short description|Method of encoding characters in a URI}}
{{Short description|Method of encoding characters in a URI}}
{{Self reference|For the urlencode in MediaWiki, see [[mw:Help:Magic words#URL data|Help:Magic words]]}}
{{Self reference|For links within Wikipedia needing percent-encoding, see {{Section link|Help:URL|Fixing links with unsupported characters}}}}


'''URL encoding''', officially known as '''percent-encoding''', is a method to [[binary-to-text encoding|encode]] arbitrary data in a [[uniform resource identifier]] (URI) using only the [[ASCII|US-ASCII]] characters legal within a URI. Although it is known as ''URL encoding'', it is also used more generally within the main [[Uniform Resource Identifier]] (URI) set, which includes both [[Uniform Resource Locator]] (URL) and [[Uniform Resource Name]] (URN). As such, it is also used in the preparation of data of the <code>application/x-www-form-urlencoded</code> [[media type]], as is often used in the submission of [[HTML]] [[form (web)|form]] data in [[HTTP]] requests.
'''URL encoding''', officially known as '''percent-encoding''', is a method to [[binary-to-text encoding|encode]] arbitrary data in a [[uniform resource identifier]] (URI) using only the [[ASCII|US-ASCII]] characters legal within a URI. Although it is known as ''URL encoding'', it is also used more generally within the main [[Uniform Resource Identifier]] (URI) set, which includes both [[Uniform Resource Locator]] (URL) and [[Uniform Resource Name]] (URN). Consequently, it is also used in the preparation of data of the <code>application/x-www-form-urlencoded</code> [[media type]], as is often used in the submission of HTML [[form (web)|form]] data in [[HTTP]] requests.


== Percent-encoding in a URI ==
== Types ==
=== Percent-encoding in a URI ===


=== Types of URI characters ===
==== Types of URI characters ====
The characters allowed in a URI are either ''reserved'' or ''unreserved'' (or a [[percent sign|percent character]] as part of a percent-encoding). ''Reserved'' characters are those characters that sometimes have special meaning. For example, [[forward slash]] characters are used to separate different parts of a URL (or more generally, a URI). ''Unreserved'' characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.
The characters allowed in a URI are either ''reserved'' or ''unreserved'' (or a [[percent sign|percent character]] as part of a percent-encoding). ''Reserved'' characters are those characters that sometimes have special meaning. For example, [[forward slash]] characters are used to separate different parts of a URL (or, more generally, a URI). ''Unreserved'' characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.


{| class="wikitable"
{| cellpadding="6px" border=1 border="1px solid #C0C0C0" class="wikitable" style="border-collapse:collapse; background-color:white"
|+RFC 3986 section 2.2 ''Reserved Characters'' (January 2005)
|+RFC 3986 section 2.2 ''Reserved Characters'' (January 2005)
|-
|-
Line 16: Line 15:
|}
|}


{| class="wikitable"
{| cellpadding="6px" border=1 border="1px solid #C0C0C0" class="wikitable" style="border-collapse:collapse; background-color:white"
|+RFC 3986 section 2.3 ''Unreserved Characters'' (January 2005)
|+RFC 3986 section 2.3 ''Unreserved Characters'' (January 2005)
|-
|-
Line 24: Line 23:
|-
|-
| <code>[[0 (number)|0]]</code> || <code>[[1 (number)|1]]</code> || <code>[[2 (number)|2]]</code> || <code>[[3 (number)|3]]</code> || <code>[[4 (number)|4]]</code> || <code>[[5 (number)|5]]</code> || <code>[[6 (number)|6]]</code> || <code>[[7 (number)|7]]</code> || <code>[[8 (number)|8]]</code> || <code>[[9 (number)|9]]</code>
| <code>[[0 (number)|0]]</code> || <code>[[1 (number)|1]]</code> || <code>[[2 (number)|2]]</code> || <code>[[3 (number)|3]]</code> || <code>[[4 (number)|4]]</code> || <code>[[5 (number)|5]]</code> || <code>[[6 (number)|6]]</code> || <code>[[7 (number)|7]]</code> || <code>[[8 (number)|8]]</code> || <code>[[9 (number)|9]]</code>
| <code>[[hyphen-minus|-]]</code> || <code>[[underscore|_]]</code> || <code>[[full stop|.]]</code> || <code>[[tilde|~]]</code> || colspan="13" | <!--empty-->
| <code>[[hyphen-minus|-]]</code> || <code>[[full stop|.]]</code> || <code>[[underscore|_]]</code> || <code>[[tilde|~]]</code> || colspan="13" | <!--empty-->
|}
|}


Line 31: Line 30:
=== Reserved characters<span class="anchor" id="Percent-encoding reserved characters"></span> ===
=== Reserved characters<span class="anchor" id="Percent-encoding reserved characters"></span> ===
When a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some ''other'' purpose, then the character must be ''percent-encoded''. Percent-encoding a reserved character involves converting the character to its corresponding byte value in [[American Standard Code for Information Interchange|ASCII]] and then representing that value as a pair of [[hexadecimal]] digits (if there is a single hex digit, a [[leading zero]] is added). The digits, preceded by a [[percent sign]] (<code>%</code>) as an [[escape character]], are then used in the URI in place of the reserved character.
When a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some ''other'' purpose, then the character must be ''percent-encoded''. Percent-encoding a reserved character involves converting the character to its corresponding byte value in [[American Standard Code for Information Interchange|ASCII]] and then representing that value as a pair of [[hexadecimal]] digits (if there is a single hex digit, a [[leading zero]] is added). The digits, preceded by a [[percent sign]] (<code>%</code>) as an [[escape character]], are then used in the URI in place of the reserved character.
(For a non-ASCII character, it is typically converted to its byte sequence in [[UTF-8]], and then each byte value is represented as above.)
(A non-ASCII character is typically converted to its byte sequence in [[UTF-8]], and then each byte value is represented as above.)


The reserved character <code>/</code>, for example, if used in the "path" component of a [[URI]], has the special meaning of being a [[Slash (punctuation)#Networking|delimiter]] ''between'' path segments. If, according to a given URI scheme, <code>/</code> needs to be ''in'' a path segment, then the three characters <code>%2F</code> or <code>%2f</code> must be used in the segment instead of a raw <code>/</code>.
The reserved character <code>/</code>, for example, if used in the "path" component of a [[URI]], has the special meaning of being a [[Slash (punctuation)#Networking|delimiter]] ''between'' path segments. If, according to a given URI scheme, <code>/</code> needs to be ''in'' a path segment, then the three characters <code>%2F</code> or <code>%2f</code> must be used in the segment instead of a raw <code>/</code>.


{| class="wikitable"
{| cellpadding="2px" border=1 border="1px solid #C0C0C0" class="wikitable" style="border-collapse:collapse; background-color:white"
|+ Reserved characters after percent-encoding
|+ Reserved characters after percent-encoding
|- align="center"
|- align="center"
| <code>[[Space_(punctuation)|␣]]</code> || <code>[[exclamation mark|!]]</code> || <code>[[double_quote|"]]</code> || <code>[[number sign|#]]</code> || <code>[[dollar sign|$]]</code> || <code>[[Percent sign|%]]</code> || <code>[[ampersand|&]]</code> || <code>[[apostrophe (mark)|']]</code> || <code>[[parenthesis|(]]</code> || <code>[[parenthesis|)]]</code> || <code>[[asterisk|<nowiki>*</nowiki>]]</code> || <code>[[plus sign|+]]</code> || <code>[[Comma|,]]</code> || <code>[[slash (punctuation)|/]]</code> || <code>[[colon (punctuation)|:]]</code> || <code>[[semicolon|;]]</code> || <code>[[equal sign|=]]</code> || <code>[[question mark|?]]</code> || <code>[[At sign|@]]</code> || <code>[[square_bracket|[]]</code> || <code>[[square_bracket|<nowiki>]</nowiki>]]</code>
| <code>[[exclamation mark|!]]</code> || <code>[[number sign|#]]</code> || <code>[[dollar sign|$]]</code> || <code>[[ampersand|&]]</code> || <code>[[apostrophe (mark)|']]</code> || <code>[[parenthesis|(]]</code> || <code>[[parenthesis|)]]</code> || <code>[[asterisk|<nowiki>*</nowiki>]]</code> || <code>[[plus sign|+]]</code> || <code>[[Comma|,]]</code> || <code>[[slash (punctuation)|/]]</code> || <code>[[colon (punctuation)|:]]</code> || <code>[[semicolon|;]]</code> || <code>[[equal sign|=]]</code> || <code>[[question mark|?]]</code> || <code>[[At sign|@]]</code> || <code>[[square_bracket|[]]</code> || <code>[[square_bracket|<nowiki>]</nowiki>]]</code>
|-
|-
| <code>%20</code> || <code>%21</code> || <code>%22</code> || <code>%23</code> || <code>%24</code> || <code>%25</code> || <code>%26</code> || <code>%27</code> || <code>%28</code> || <code>%29</code> || <code>%2A</code> || <code>%2B</code> || <code>%2C</code> || <code>%2F</code> || <code>%3A</code> || <code>%3B</code> || <code>%3D</code>|| <code>%3F</code> || <code>%40</code> || <code>%5B</code> || <code>%5D</code>
| <code>%21</code> || <code>%23</code> || <code>%24</code> || <code>%26</code> || <code>%27</code> || <code>%28</code> || <code>%29</code> || <code>%2A</code> || <code>%2B</code> || <code>%2C</code> || <code>%2F</code> || <code>%3A</code> || <code>%3B</code> || <code>%3D</code>|| <code>%3F</code> || <code>%40</code> || <code>%5B</code> || <code>%5D</code>
|}
|}


Line 54: Line 53:
URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers ''should not'' treat <code>%41</code> differently from <code>A</code> or <code>%7E</code> differently from <code>~</code>, but some do. For maximal interoperability, URI producers are discouraged from percent-encoding unreserved characters.
URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers ''should not'' treat <code>%41</code> differently from <code>A</code> or <code>%7E</code> differently from <code>~</code>, but some do. For maximal interoperability, URI producers are discouraged from percent-encoding unreserved characters.


=== Percent character<span class="anchor" id="Percent-encoding the percent character"></span> ===
==== Percent character<span class="anchor" id="Percent-encoding the percent character"></span> ====
Because the percent character ( <code>%</code> ) serves as the indicator for percent-encoded octets, it must be percent-encoded as <code>%25</code> for that octet to be used as data within a URI.
Because the percent character ( <code>%</code> ) serves to indicate percent-encoded octets, it must itself be percent-encoded as <code>%25</code> to be used as data within a URI.


=== Arbitrary data<span class="anchor" id="Percent-encoding arbitrary data"></span> ===
==== Arbitrary data<span class="anchor" id="Percent-encoding arbitrary data"></span> ====
Most URI schemes involve the representation of arbitrary data, such as an [[IP address]] or [[file system]] path, as components of a URI. URI scheme specifications should, but often do not, provide an explicit mapping between URI characters and all possible data values being represented by those characters.
Most URI schemes involve the representation of arbitrary data, such as an [[IP address]] or [[file system]] path, as components of a URI. URI scheme specifications should, but often do not, provide an explicit mapping between URI characters and all possible data values being represented by those characters.


==== Binary data ====
===== Binary data =====
Since the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of [[binary data]] in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above.<ref>RFC 1738 §2.2; RFC 2396 §2.4; RFC 3986 §1.2.1, 2.1, 2.5.</ref> Byte value 0x0F, for example, should be represented by <code>%0F</code>, but byte value 0x41 can be represented by <code>A</code>, or <code>%41</code>. The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred, as it results in shorter URLs.
Since the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of [[binary data]] in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above.<ref>RFC 1738 §2.2; RFC 2396 §2.4; RFC 3986 §1.2.1, 2.1, 2.5.</ref> Byte value 0x0F, for example, should be represented by <code>%0F</code>, but byte value 0x41 can be represented by <code>A</code>, or <code>%41</code>. The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred, as it results in shorter URLs.


==== Character data ====
===== Character data =====
The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the [[World Wide Web]]'s formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, [[state (computer science)|stateful]], and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.
The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the [[World Wide Web]]'s formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, [[state (computer science)|stateful]], and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.


For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified [[character encoding]] before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.
For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified [[character encoding]] before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.


{| class="wikitable"
{| cellpadding="2px" border=1 border="1px solid #C0C0C0" class="wikitable" style="border-collapse:collapse; background-color:white"
|+ Common characters after percent-encoding (ASCII or UTF-8 based)
|+ Common characters after percent-encoding (ASCII or UTF-8 based)
|- align="center"
|- align="center"
|<code>[[Space (punctuation)|␣]]</code>
| <code>[[newline]]</code> || <code>[[space (punctuation)|space]]</code> || <code>[[double quote|"]]</code> || <code>[[Percent sign|%]]</code> || <code>[[hyphen|-]]</code> || <code>[[full stop|.]]</code> || <code>[[angle bracket|<]]</code> || <code>[[angle bracket|>]]</code> || <code>[[back slash|\]]</code> || <code>[[caret|^]]</code> || <code>[[underscore|_]]</code> || <code>[[grave accent|`]]</code> || <code>[[curly bracket|{]]</code> || <code>[[vertical bar|<nowiki>|</nowiki>]]</code> || <code>[[curly bracket|}]]</code> || <code>[[tilde|~]]</code> || <code>[[£]]</code> || <code>[[€]]</code>
|<code>[[Double quote|"]]</code>
|<code>[[Percent sign|%]]</code>
| <code>[[hyphen|-]]</code> || <code>[[full stop|.]]</code> || <code>[[angle bracket|<]]</code> || <code>[[angle bracket|>]]</code> || <code>[[back slash|\]]</code> || <code>[[caret|^]]</code> || <code>[[underscore|_]]</code> || <code>[[grave accent|`]]</code> || <code>[[curly bracket|{]]</code> || <code>[[vertical bar|<nowiki>|</nowiki>]]</code> || <code>[[curly bracket|}]]</code> || <code>[[tilde|~]]</code> || <code>[[£]]</code> || <code>[[€]]</code>
|- align="center" valign="top"
|- align="center" valign="top"
|<code>%20</code>
| <code>%0A</code> ''or'' <code>%0D</code> ''or'' <code>%0D%0A</code> || <code>%20</code> || <code>%22</code> || <code>%25</code> || <code>%2D</code> || <code>%2E</code> || <code>%3C</code> || <code>%3E</code> || <code>%5C</code> || <code>%5E</code> || <code>%5F</code> || <code>%60</code> || <code>%7B</code> || <code>%7C</code> || <code>%7D</code> || <code>%7E</code> || <code>%C2%A3</code> || <code>%E2%82%AC</code>
|<code>%22</code>
|<code>%25</code>
| <code>%2D</code> || <code>%2E</code> || <code>%3C</code> || <code>%3E</code> || <code>%5C</code> || <code>%5E</code> || <code>%5F</code> || <code>%60</code> || <code>%7B</code> || <code>%7C</code> || <code>%7D</code> || <code>%7E</code> || <code>%C2%A3</code> || <code>%E2%82%AC</code>
|}
|}


Arbitrary character data is sometimes percent-encoded and used in non-URI situations, such as for password-obfuscation programs or other system-specific translation protocols.
Arbitrary character data is sometimes percent-encoded and used in non-URI situations, such as for password-obfuscation programs or other system-specific translation protocols.


=== Current standard ===
==== Current standard ====
{{main article|Internationalized Resource Identifier}}
{{main article|Internationalized Resource Identifier}}
The generic URI syntax recommends that new URI schemes that provide for the representation of character data in a URI should, in effect, represent characters from the unreserved set without translation and should convert all other characters to bytes according to [[UTF-8]], and then percent-encode those values. This suggestion was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
The generic URI syntax recommends that new URI schemes that provide for the representation of character data in a URI should, in effect, represent characters from the unreserved set without translation and should convert all other characters to bytes according to [[UTF-8]], and then percent-encode those values. This suggestion was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Line 84: Line 89:
Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary or character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.
Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary or character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.


=== Non-standard implementations ===
==== Non-standard implementations ====
There exists a non-standard encoding for Unicode characters: <code>%u''xxxx''</code>, where ''xxxx'' is a [[UTF-16]] code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been [http://www.w3.org/International/iri-edit/draft-duerst-iri.html rejected] by the W3C. The 13th edition of [[ECMA-262]] still includes an <code>escape</code> function that uses this syntax, which applies [[UTF-8]] encoding to a string, then percent-escapes the resulting bytes.<ref>{{cite web|url=https://www.ecma-international.org/ecma-262/8.0/index.html|title=ECMAScript 2017 Language Specification (ECMA-262, 8th edition, June 2017)|publisher=Ecma International|access-date=2018-06-20|archive-date=2018-07-02|archive-url=https://web.archive.org/web/20180702045054/http://www.ecma-international.org/ecma-262/8.0/index.html|url-status=live}}</ref>
There exists a non-standard encoding for Unicode characters: <code>%u''xxxx''</code>, where ''xxxx'' is a [[UTF-16]] code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been [http://www.w3.org/International/iri-edit/draft-duerst-iri.html rejected] by the W3C. The 13th edition of [[ECMA-262]] still includes an <code>escape</code> function that uses this syntax, which applies [[UTF-8]] encoding to a string, then percent-escapes the resulting bytes.<ref>{{cite web|url=https://www.ecma-international.org/ecma-262/8.0/index.html|title=ECMAScript 2017 Language Specification (ECMA-262, 8th edition, June 2017)|publisher=Ecma International|access-date=2018-06-20|archive-date=2018-07-02|archive-url=https://web.archive.org/web/20180702045054/http://www.ecma-international.org/ecma-262/8.0/index.html|url-status=live}}</ref>


==The application/x-www-form-urlencoded type==
===The application/x-www-form-urlencoded type===
<!-- [[application/x-www-form-urlencoded]] and [[x-www-form-urlencoded]] redirect to this section -->
<!-- [[application/x-www-form-urlencoded]] and [[x-www-form-urlencoded]] redirect to this section -->
When data that has been entered into HTML [[form (web)|form]]s is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method [[Hypertext Transfer Protocol#Request methods|GET]] or [[POST (HTTP)|POST]], or, historically, via [[email]].<ref>User-agent support for email based [[HyperText Markup Language|HTML]] form submission, using a 'mailto' [[Uniform Resource Locator|URL]] as the form action, was proposed in RFC 1867 section 5.6, during the HTML 3.2 era. Various web browsers implemented it by invoking a separate email program or using their own rudimentary [[Simple Mail Transfer Protocol|SMTP]] capabilities. Although sometimes unreliable, it was briefly popular as a simple way to transmit form data without involving a web server or [[Common Gateway Interface|CGI]] scripts.</ref> The encoding used by default is based on an early version of the general URI percent-encoding rules,<ref>{{Cite journal|url=https://tools.ietf.org/html/rfc1630|title=RFC 1630|last=Berners-Lee|first=T.|date=June 1994|website=IETF Tools|publisher=IETF|access-date=29 June 2016|archive-date=21 June 2016|archive-url=https://web.archive.org/web/20160621035940/https://tools.ietf.org/html/rfc1630|url-status=live}}</ref> with a number of modifications such as [[newline]] normalization and replacing spaces with <code>+</code> instead of <code>%20</code>. The [[media type]] of data encoded this way is <code>application/x-www-form-urlencoded</code>, and it is currently defined in the HTML and [[XForms]] specifications. In addition, the [[Common Gateway Interface|CGI]] specification contains rules for how web servers decode data of this type and make it available to applications.
When data that has been entered into HTML [[form (web)|form]]s is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method [[Hypertext Transfer Protocol#Request methods|GET]] or [[POST (HTTP)|POST]], or, historically, via [[email]].<ref>User-agent support for email based [[HyperText Markup Language|HTML]] form submission, using a 'mailto' [[Uniform Resource Locator|URL]] as the form action, was proposed in RFC 1867 section 5.6, during the HTML 3.2 era. Various web browsers implemented it by invoking a separate email program or using their own rudimentary [[Simple Mail Transfer Protocol|SMTP]] capabilities. Although sometimes unreliable, it was briefly popular as a simple way to transmit form data without involving a web server or [[Common Gateway Interface|CGI]] scripts.</ref> The encoding used by default is based on an early version of the general URI percent-encoding rules,<ref>{{Cite journal|url=https://tools.ietf.org/html/rfc1630|title=RFC 1630|last=Berners-Lee|first=T.|date=June 1994|website=IETF Tools|publisher=IETF|access-date=29 June 2016|archive-date=21 June 2016|archive-url=https://web.archive.org/web/20160621035940/https://tools.ietf.org/html/rfc1630|url-status=live}}</ref> with a number of modifications such as [[newline]] normalization and replacing spaces with <code>+</code> instead of <code>%20</code>. The [[media type]] of data encoded this way is <code>application/x-www-form-urlencoded</code>, and it is currently defined in the HTML and [[XForms]] specifications. In addition, the [[Common Gateway Interface|CGI]] specification contains rules for how web servers decode data of this type and make it available to applications.
Line 95: Line 100:


== See also ==
== See also ==
{{Wikifunctions|Z10761|URI percent encode}}
{{Wikifunctions|Z10774|URI percent decode}}
* [[Internationalized Resource Identifier]]
* [[Internationalized Resource Identifier]]
* [[Punycode]]
* [[Punycode]]
Line 105: Line 112:


== External links ==
== External links ==

The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other:
The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other:
* {{IETF RFC|3986|link=no}} / [[Internet standard|STD]] 66 (plus [http://www.rfc-editor.org/errata_search.php?rfc=3986 errata]), the current generic URI syntax specification.
* {{IETF RFC|3986|link=no}} / [[Internet standard|STD]] 66 (plus [http://www.rfc-editor.org/errata_search.php?rfc=3986 errata]), the current generic URI syntax specification.
Line 114: Line 120:
* [http://www.w3.org/International/O-URL-code.html W3C explanation of UTF-8 in URIs]
* [http://www.w3.org/International/O-URL-code.html W3C explanation of UTF-8 in URIs]
* [http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1 W3C HTML form content types]
* [http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1 W3C HTML form content types]

Various implementations:
* [https://devpal.co/url-encode/ DevPal URL encoder] – online developer tools that support URL encoding.
* [https://httptools.dev/url-encoder-decoder Online URL encoder and decoder] – encodes or decodes URLs within the browser.
* [https://urlencoderonline.toolerbar.com/?m=1 URL Encoder online] – a website with various options to convert files or texts into URL-encoded format.
* [https://www.urlencoder.org/ URL Encode and Decode - Online] – a website with various options to convert files or texts into URL-encoded format.


[[Category:URI schemes]]
[[Category:URI schemes]]

Latest revision as of 21:42, 1 November 2024

URL encoding, officially known as percent-encoding, is a method to encode arbitrary data in a uniform resource identifier (URI) using only the US-ASCII characters legal within a URI. Although it is known as URL encoding, it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). Consequently, it is also used in the preparation of data of the application/x-www-form-urlencoded media type, as is often used in the submission of HTML form data in HTTP requests.

Types

[edit]

Percent-encoding in a URI

[edit]

Types of URI characters

[edit]

The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are those characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of a URL (or, more generally, a URI). Unreserved characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.

RFC 3986 section 2.2 Reserved Characters (January 2005)
! # $ & ' ( ) * + , / : ; = ? @ [ ]
RFC 3986 section 2.3 Unreserved Characters (January 2005)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - . _ ~

Other characters in a URI must be percent-encoded.

Reserved characters

[edit]

When a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded. Percent-encoding a reserved character involves converting the character to its corresponding byte value in ASCII and then representing that value as a pair of hexadecimal digits (if there is a single hex digit, a leading zero is added). The digits, preceded by a percent sign (%) as an escape character, are then used in the URI in place of the reserved character. (A non-ASCII character is typically converted to its byte sequence in UTF-8, and then each byte value is represented as above.)

The reserved character /, for example, if used in the "path" component of a URI, has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, / needs to be in a path segment, then the three characters %2F or %2f must be used in the segment instead of a raw /.

Reserved characters after percent-encoding
! # $ & ' ( ) * + , / : ; = ? @ [ ]
%21 %23 %24 %26 %27 %28 %29 %2A %2B %2C %2F %3A %3B %3D %3F %40 %5B %5D

Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not.

In the "query" component of a URI (the part after a ? character), for example, / is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose.

URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting the same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.

Unreserved characters

[edit]

Characters from the unreserved set never need to be percent-encoded.

URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers should not treat %41 differently from A or %7E differently from ~, but some do. For maximal interoperability, URI producers are discouraged from percent-encoding unreserved characters.

Percent character

[edit]

Because the percent character ( % ) serves to indicate percent-encoded octets, it must itself be percent-encoded as %25 to be used as data within a URI.

Arbitrary data

[edit]

Most URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often do not, provide an explicit mapping between URI characters and all possible data values being represented by those characters.

Binary data
[edit]

Since the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above.[1] Byte value 0x0F, for example, should be represented by %0F, but byte value 0x41 can be represented by A, or %41. The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred, as it results in shorter URLs.

Character data
[edit]

The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the World Wide Web's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, stateful, and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.

For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified character encoding before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.

Common characters after percent-encoding (ASCII or UTF-8 based)
" % - . < > \ ^ _ ` { | } ~ £
%20 %22 %25 %2D %2E %3C %3E %5C %5E %5F %60 %7B %7C %7D %7E %C2%A3 %E2%82%AC

Arbitrary character data is sometimes percent-encoded and used in non-URI situations, such as for password-obfuscation programs or other system-specific translation protocols.

Current standard

[edit]

The generic URI syntax recommends that new URI schemes that provide for the representation of character data in a URI should, in effect, represent characters from the unreserved set without translation and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This suggestion was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary or character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.

Non-standard implementations

[edit]

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The 13th edition of ECMA-262 still includes an escape function that uses this syntax, which applies UTF-8 encoding to a string, then percent-escapes the resulting bytes.[2]

The application/x-www-form-urlencoded type

[edit]

When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST, or, historically, via email.[3] The encoding used by default is based on an early version of the general URI percent-encoding rules,[4] with a number of modifications such as newline normalization and replacing spaces with + instead of %20. The media type of data encoded this way is application/x-www-form-urlencoded, and it is currently defined in the HTML and XForms specifications. In addition, the CGI specification contains rules for how web servers decode data of this type and make it available to applications.

When HTML form data is sent in an HTTP GET request, it is included in the query component of the request URI using the same syntax described above. When sent in an HTTP POST request or via email, the data is placed in the body of the message, and application/x-www-form-urlencoded is included in the message's Content-Type header.

See also

[edit]

References

[edit]
  1. ^ RFC 1738 §2.2; RFC 2396 §2.4; RFC 3986 §1.2.1, 2.1, 2.5.
  2. ^ "ECMAScript 2017 Language Specification (ECMA-262, 8th edition, June 2017)". Ecma International. Archived from the original on 2018-07-02. Retrieved 2018-06-20.
  3. ^ User-agent support for email based HTML form submission, using a 'mailto' URL as the form action, was proposed in RFC 1867 section 5.6, during the HTML 3.2 era. Various web browsers implemented it by invoking a separate email program or using their own rudimentary SMTP capabilities. Although sometimes unreliable, it was briefly popular as a simple way to transmit form data without involving a web server or CGI scripts.
  4. ^ Berners-Lee, T. (June 1994). "RFC 1630". IETF Tools. IETF. Archived from the original on 21 June 2016. Retrieved 29 June 2016.
[edit]

The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other: