Talk:UTF-8
This is the talk page for discussing improvements to the UTF-8 article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: Index, 1, 2, 3, 4, 5Auto-archiving period: 3 months |
This article has not yet been rated on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||||||
Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.
Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.
Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.
|
5- and 6-byte encodings
UTF-8, as it stands, does not know 5- and 6-byte encodings, that’s a fact. Having this very important fact buried in the third paragraph after the table of “design of UTF-8 as originally proposed” is just misleading. I would even prefer a table with those encodings removed altogether, which would be still better than the current version. I agree it might be good to show them, but we need to be very clear there is an important caveat in there. I fail to see how a slight color in background could be confusing (maybe the single-cell in the 4-byte encodings? I do not insist on that one), I was more afraid of a (correct) reminder about accessibility than that.
If you dislike coloring, which device would you find acceptable? Maybe just a thicker line below the 4-byte row? Whatever, it just needs some distinction.
--Mormegil (talk) 17:20, 24 February 2013 (UTC)
- I don't agree that some distinction is required in the table. It is presented as a table illustrating the original design. That's a fact, as you say. If you're concerned that people won't get the message that encodings conforming to RFC 3629 limit the range, then move that proviso into the sentence introducing the table. Trying to indicate it graphically in the table will just muddy the idea behind the design, and will require more explanatory fine print. -- Elphion (talk) 20:57, 24 February 2013 (UTC)
- Basically agree with Elphion... AnonMoos (talk) 00:46, 25 February 2013 (UTC)
- Agree here too. It clearly states this is the *ORIGINAL* design. The reason the table is used is that the repeating pattern is far easier to see with 6 lines than with 4. It is immediately followed with a paragraph that gives the dates and standards where the design was truncated. I also think this is a far clearer to show the 6 rows and then state that the last two were removed, than to show 4 rows and then later show the 2 rows that were removed (and the 5/6 byte sequences are an important part of UTF-8 history so they must be here). (talk) 06:35, 25 February 2013 (UTC)
- I think the table should not include the obsolete 5/6 byte sequences, at all. Very misleading - it fooled me. — Preceding unsigned comment added by 90.225.89.28 (talk) 12:28, 2 June 2013 (UTC)
- The table shall not contain the 5-6 byte sequences. The article shall present first what UTF-8 is today, and then, as a separate section, describe what it was many years ago. It is very confusing to present the material in the order of chronological development. Keep in mind that many come here to take a quick reference for UTF-8 as it is today, and the history is not that important for them. In other words, the article shall present the material in "the most important things go first" order. bungalo (talk) 12:16, 7 September 2013 (UTC)
- There is still a lot of confusion among programmers, who think that UTF-8 can be as long as 6 bytes, and therefore it's "bad". Looking at this article, or "Joel on unicode" say, explains why. I blame you for this confusion. Many readers will look at the diagrams only, and don't bother to read the text. It is legitimate, and you should take them into consideration. That UTF-8 was once a 6-byte encoding is irrelevant for anything but historical curiosity. bungalo (talk) 12:42, 7 September 2013 (UTC)
- It's legitimate for a programmer to only look at the diagram? I'm not sure how that makes sense; if you understand the syllogism "UTF-8 can be as long as 6 bytes" -> "it's bad" you should understand enough about Unicode to understand that UTF-8 is not 6 bytes long. In any case, I don't know of any force that can stop a hypothetical programmer that dismisses a technology based on
readingskimminglooking at the diagrams in a Wikipedia article.--Prosfilaes (talk) 21:06, 7 September 2013 (UTC)
- It's legitimate for a programmer to only look at the diagram? I'm not sure how that makes sense; if you understand the syllogism "UTF-8 can be as long as 6 bytes" -> "it's bad" you should understand enough about Unicode to understand that UTF-8 is not 6 bytes long. In any case, I don't know of any force that can stop a hypothetical programmer that dismisses a technology based on
Ybungalobill -- if those people just have the patience to scroll down to the big table, then they can see things that should be avoided highlighted in bright red... AnonMoos (talk) 23:07, 7 September 2013 (UTC)
Keep the original table. The cutoff is in the middle of the 4-byte sequences, so I do not believe truncating the table between 4 and 5 byte sequences makes any sense. The longer sequences make the pattern used much more obvious.Spitzak (talk) 21:07, 8 September 2013 (UTC)
UTF-8 supports 5- and 6-byte values perfectly well - UNICODE doesn't use them, and thus UNICODE-in-UTF-8 is restricted to the more limited range. (to belabor a point) Encoding high-end UTF-8 beyond the UNICODE range is perfectly legitimate, just don't call it UNICODE - unless UNICODE itself has (in some probably near future) expanded beyond the range it's using today. (more belaboring) The 0x10FFFF is a UNICODE-specific constraint, not one of UTF-8.--— Preceding unsigned comment added by 70.112.90.192 (talk • contribs)
- Unicode = ISO 10646 or UCS. UTF = UCS Transformation Format. That is, what UTF-8 is designed to process doesn't use values above 0x10FFFF, and so 5- and 6-byte values are irrelevant. There's no anticipation of needing them; there's 1,000 years of space at the current rate of growth of Unicode, which is expected to trend downward.
- You can encode stuff beyond 0x10FFFF, but it's no longer a UCS Transformation Format. I'm not sure why you'd do this--hacking non-text data into a nominal text stream?--but it's a local hack, not something that has ever been useful nor something that is widely supported.--Prosfilaes (talk) 12:57, 28 February 2014 (UTC)
- No, what the UTF-8 encoding scheme was "designed to process" was the full 2^31 space. The UTF-8 standard transformation format uses it only for the Unicode codepoints, and a compliant UTF-8 decoder would report out-of-range values as errors. I think we make that abundantly clear in the article. But "1,000 years of space at the current rate of growth" reminds me of "640K ought to be enough for anybody". Whether we'll ever need to look for larger limits is a moot point. There's no particular reason to prohibit software from considering such sequences. And it's certainly not a good reason to obscure the history of the scheme. I think the article currently strikes the right balance between history and current practice. -- Elphion (talk) 18:26, 5 March 2014 (UTC)
- It is incoherent to say "the full 2^31 space" without the context that implies "the full 2^31 space of Unicode". So it's not "no"; and in fact, I would say the emphasis is wrong; they wanted to support Unicode/ISO 10646, no matter what its form, not the 2^31 space. There is good reason to stop software from considering such sequences; "if you find F5, reject it" is much safer then adding poorly-tested code to process it, just to reject it at later level, and discouraging ad-hoc extensions to standard protocols is its own good. libtiff has had security holes because it supported features that that nobody had noticed hadn't worked in years. Whether we'll ever need to look for larger limits is not a moot point; writing unneeded, possibly buggy code for a situation that may come up is not wise.
- If you want a copy of every book Harper Lee wrote, how many bookcases are you going to put up? Personally, I'm not going to put up multiple bookcases on the nigh-inconceivable chance that somehow dozens of new books are going to appear from her pen. We knew that memory was something people were going to use more of, but every single character anyone can think of encoding, including many that nobody cares about, fits on four Unicode planes, some 240,000 characters, with plenty of blank space.--Prosfilaes (talk) 03:43, 6 March 2014 (UTC)
- It is not incoherent: everybody (even you) knows what is meant. The scheme was designed when Unicode was expected to include 2^31 codepoints, and that is what the scheme was designed to cover. As for broken software, nothing you say will prevent it from being written. The only reasonable defense is to write and promote good software. Software that parses 5 and 6 byte sequences as well as unused 4 byte sequences is not necessarily bad software. In terms of safety, I would argue that well tested parsing routines that handle 5- and 6-bytes sequences are inherently safer than adding special case rejections at an early stage. It is certainly a more flexible approach. And the analogy with physical bookcases is not particularly apt; keeping code flexible adds only minimal overhead. And in any event, your opinion or mine about how software should go about handling out-of-range sequences is really beyond the scope of this article. It suffices that a compliant reader report the errors. -- Elphion (talk) 14:38, 6 March 2014 (UTC)
- It is incoherent outside that context, and once we explicitly add that context it changes things. What it was designed to process is ISO-10646; the fact that they planned for a lot larger space is a minor detail. In terms of safety, your saying that well-tested parsing routines that have <F5> => error are less safe then <F5>... => some number that has to be filtered away later? If you believe your opinion about this subject is beyond scope, then don't bring it up. The simple fact is that UTF-8 in the 21st century only supports four byte sequences, that no encoder or decoder in history has ever had reason to handle anything longer. Emphasis should be laid on what it is, not what it was.--Prosfilaes (talk) 23:23, 6 March 2014 (UTC)
- "You keep using that word. I do not think it means what you think it means." (:-) -- Elphion (talk) 15:40, 7 March 2014 (UTC)
- The original design did in fact aim to cover the full 2^31 space. Ken Thompson's proposal [1] states: "The proposed UCS transformation format encodes UCS values in the range [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, 5, and 6 bytes." -- Elphion (talk) 16:08, 7 March 2014 (UTC)
- The original design did cover the then-full 2^31 space. But that's in the technical part of the document; the aim of UTF-8 is stated above:
- With the approval of ISO/IEC 10646 (Unicode) as an international standard and the anticipated wide spread use of this universal coded character set (UCS), it is necessary for historically ASCII based operating systems to devise ways to cope with representation and handling of the large number of characters that are possible to be encoded by this new standard.
- So, no, it did not aim to cover the full 2^31 space, it aimed to handle "the large number of characters that are possible to be encoded by this new standard."--Prosfilaes (talk) 22:28, 7 March 2014 (UTC)
- That is a weird interpretation of that sentence. That some characters are "possible to be encoded" does not say anything about what "could" be encoded by that method. −Woodstone (talk) 06:02, 8 March 2014 (UTC)
- I don't understand your response. "Could" and "possible" mean basically the same thing. I think that sentence is their goal, to cover the characters of Unicode, not the 2^31 space.--Prosfilaes (talk) 22:01, 8 March 2014 (UTC)
Hi, I just wanted to say that I was using this article for research, and I also found the table to be confusing. It isn't inherently wrong, but as-is it belongs in a History or Technical Background section, not at the top of Description which should reflect current standards and practice. If the table does stay, I think it should be updated to clarify current usage *within the table itself* with a note, color coding, etc. Perhaps we can unite around the general principle that tables/charts/diagrams should be self sufficient, and not rely on surrounding prose for critical clarifications. Proxyma (talk) 15:03, 6 July 2014 (UTC)
- No, there is no reason to have two very similar tables. In addition the pattern is much easier to see with the 5 & 6 byte lines. Furthermore, a table "reflecting current usage" would have to somehow stop in the *middle* of the 4-byte line. Including the entire 4-byte line is misleading. Nobody seems to have any idea how to do that. Please leave the table as-is. This has been discussed enough.Spitzak (talk) 02:38, 7 July 2014 (UTC)
- This discussion seems to be based on different opinions about what is easier and more straightforward, so it's hard for me to see how the case has been closed. I gave my feedback because as a new reader I experienced the confusion others warned about here, and I think it's important to focus on the semi-casual reader. Perhaps it's human irrationality, but when readers see a big chart at the top, they interpret it as authoritative, and wouldn't consider parsing the rest of the text to see if it's later contradicted. I agree that two similar charts may be overkill, but in that case we should remove the one which has been inaccurate for more than a decade. Proxyma (talk) 03:03, 7 July 2014 (UTC)
- It would be useful if you could describe how you were confused. The table is quite clear, showing the layout for codepoints U+0000 to U+7FFFFFFF. The accompanying text explains that the current standard uses the scheme for the initial portion up to U+10FFFF, which goes into the 4-byte area but does not exhaust it. This seems perfectly clear to me. Any table trying to show the "21-bit" space directly would not be nearly as clear; it would obscure the design of the encoding, and would require more verbiage to explain it. The one improvement I would suggest is that the reduction of the codespace to U+10FFFF might usefully come before the table, so that the reader understands immediately that the full scheme is not currently used by Unicode. --- Elphion (talk) 04:23, 7 July 2014 (UTC)
- Elphion, I think you and I basically agree. The only modification I'd make to your proposal is to suggest that the clarification of the codespace reduction be made within the table itself. As I said, I think tables/charts/graphs/etc should be self-contained with respect to essential information. The possible exception is a caption, but that's effectively part of what it's captioning. As for why I was confused, it was because the table didn't include such a clarification. I think sometimes it's difficult for those of us who edit an article to see it "with fresh eyes" like a new reader. When we look at the table, we're already aware of the content of the following prose because we've already read it. Proxyma (talk) 06:44, 8 July 2014 (UTC)
- There have been endless attempts to colorize the table and split line 4 to "clarify it". All results are obviously less clear and have been reverted. They hid the pattern (by splitting line 4) and they just had to add more text than is currently attached to explain what the colored portion did. Or they did not split line 4 but used 3 colors and added even more text than is currently attached. Face it, it is impossible. Stop trying. Only possible change may be to move some of the text before the table, but I think that is less clear than the current order of "original design:", table, "modified later by this rfc...". That at least is in chronological order.Spitzak (talk) 18:56, 8 July 2014 (UTC)
Moved from article
(Should the word deprecated be added here like this | They supersede the definitions given in the following deprecated and/or obsolete works: ? Cya2evernote (talk) 14:31, 11 February 2014 (UTC))
Noncharacters
- Incnis Mrsi made a change to state that surrogates and noncharacters may not be encoded in UTF-8, and I changed this to only surrogates as noncharacters can be legally represented in UTF-8. BIL then reverted my edit with the comment "Noncharacters, such as reverse byte-order-mark 0xFFFE, shall not be encoded, and software are allowed to remove or replace them in the same ways as for single surrogates". This is simply untrue, and I am pretty sure that nowhere in the Unicode Standard does it specify that noncharacters should be treated as illegal codepoints such as unpaired surrogates. In fact the Unicode Standard Corrigendum #9: Clarification About Noncharacters goes out of its way to explain that noncharacters are permissible for interchange, and that they are called noncharacters because "they are permanently prohibited from being assigned standard, interchangeable meanings, rather than that they are prohibited from occurring in Unicode strings which happen to be interchanged". I think it is clear that noncharacters can legitimately be exchanged in encoded text, and as they can be represented in UTF-8, the article should not claim that they cannot be represented in UTF-8. BabelStone (talk) 18:04, 5 March 2014 (UTC)
- The Unicode standard seems only concerned with making sure UTF-16 can be used. The noncharacters mentioned can be encoded in UTF-16 no problem. Only the surrogate halves cannot be encoded in UTF-16 so they are trying to fix this by declaring them magically illegal and pretending they don't happen. So there is a difference and user BIL is correct. (note that I think UTF-16 is seriously broken and should have provided a method of encoding a continuous range, just like UTF-8 can encode the range 0x80-0xff even thought those values are also 'surrogate halves'.Spitzak (talk) 05:38, 7 March 2014 (UTC)
Proposal of UTF-8 use lists
The article's introduction have an assertion that need citation:
- "UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications"
Is difficult to find a unique source for all these applications. The alternative is to start a Wikipedia-list for all surveys:
So, in the list, grouped as below, add tables with columns "name", "extent of compatibility", "have suport to UTF-8" and "use UTF-8 as default". Tables:
- Standards:
- Operating system specifications compatible with UTF-8 (ex. POSIX);
- Programming language specifications compatible with UTF-8 (ex. Python);
- Web protocols compatible with UTF-8 (ex. SOAP);
- ...
- Software:
- Operating systems compatible with UTF-8;
- Compilers compatible with UTF-8;
- Mobile APIs compatible with UTF-8;
- ... compatible with UTF-8;
--Krauss (talk) 11:17, 10 August 2014 (UTC)
- There's no real limit to these lists, and no clear definition. Is Unix v7 compatible with UTF-8 because you can store arbitrary non-ASCII bytes in filenames? A lot of Unix and Unix programs are high-bit safe. Python isn't especially compatible with UTF-8; it can input any number of character sets, and I believe its internal encoding is nonstandard. Likewise, a lot of programs can process UTF-8 as one character set among many.--Prosfilaes (talk) 21:12, 10 August 2014 (UTC)
- I think that there are two simple and objective criteria:
- a kind of "self-determination": the software express (ex. in the manual pages) that is UTF-8 compatible;
- a kind of confirmation: other sources confirm the UTF-8 compatibility.
- No more, no less... It is enough for the list objectives, for users, etc. See EXEMPLES below. --Krauss (talk) 00:51, 18 August 2014 (UTC)
- I think that there are two simple and objective criteria:
Examples
Draft illustrating the use of the two types of references, indicating "self-determination", and "confirming that it does".
- Python3:
- Source code is UTF-8 compatible. Self-determination ref-1 and ref-2. Independent sources: S1. "By default, Python source files are treated as encoded in UTF-8.", [Van Rossum, Guido, and Fred L. Drake Jr. Python tutorial. Centrum voor Wiskunde en Informatica, 1995]. S2. "In Python 3, all strings are sequences of Unicode characters". diveintopython3.net.
- Build-in functions are UTF8-compatible. Self-determination string — Common string operations. Independent sources: ...
- Support at the core language level: no.
- PHP5:
- Source code is UTF-8 compatible. ...
- SOME build-in functions are UTF8-compatible. see
mb_*
functions and PCRE... and str_replace() and some another ones. - not compatible, but accepts automatically UTF-8 source-code and incorpore compatible libraries like mb*, PCRE, etc.
- Support at the core language level: no. (see PHP6 history).
- MySQL: yes, have compatible modes. ...
- PostgreSQL: yes. have compatible modes. ...
- libXML2: use UTF-8 as default (support at the core level)...
- ...
— Preceding unsigned comment added by Krauss (talk • contribs) 00:51, 18 August 2014
- I don't think a list of software compatible with UTF-8 is useful. Eventually, all software that is used in any notable manner will be UTF-8 compatible. To do the job properly would require exhaustive mentions of versions and a definition of "compatible" (Lua is compatible with UTF-8 but has no support for it). Such a list is not really suitable here. Johnuniq (talk) 01:25, 18 August 2014 (UTC)
- Maybe UTF-8 usage is increasing but I don't think it is taking any lead. The heavily used languages C# and Java use UTF-16 as default and Windows does also. I don't think that will change in short term. --BIL (talk) 07:58, 18 August 2014 (UTC)
- Sure, but even Notepad can read and write UTF-8 these days, so it would feature on a list of software compatible with UTF-8. I can't resist spreading the good word: http://utf8everywhere.org/ Johnuniq (talk) 11:00, 18 August 2014 (UTC)
- Maybe UTF-8 usage is increasing but I don't think it is taking any lead. The heavily used languages C# and Java use UTF-16 as default and Windows does also. I don't think that will change in short term. --BIL (talk) 07:58, 18 August 2014 (UTC)
- Oppose - as Johnuniq says, this list will be huge and essentially useless. RossPatterson (talk) 10:41, 18 August 2014 (UTC)
- I have no idea what it means for Python 3 to not have "support at the core language level". It reads in and writes out UTF-8 and hides the details of the encoding of the Unicode support. I don't think this is a productive thing to add to the page.--Prosfilaes (talk) 22:00, 18 August 2014 (UTC)
- Oppose, per Johnuniq's explanation. Such a list would be too long, it would never be complete, and it would doubtfully be used for the intended purpose. — Dsimic (talk | contribs) 08:20, 22 August 2014 (UTC)
Next step ...
- To remove the assertion "UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications" of the article's introduction. Need citation, but, as demonstred, never will get one.
- ... Think about another kind of list, tractable and smaller, like "List of software that are UFT-8 FULL compatible"; that is, discuss here what is "full compatible" in nowadays. Examples: LibXML2 can be showed as "configured with UTF8 by default" and "full compatible"; PHP was looking for "full compatibility" and "Unicode integration" with PHP6, but abandoned the project.
--Krauss (talk) 09:35, 22 August 2014 (UTC)
A bit of searching found these:
https://developer.apple.com/library/mac/qa/qa1173/_index.html
https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
http://wayland.freedesktop.org/docs/html/apa.html#protocol-spec-wl_shell_surface-request-set_title
Spitzak (talk) 00:51, 24 August 2014 (UTC)
Double error correction
[[:File:UnicodeGrow2010.png|thumb|360px|Graph indicating that UTF-8 (light blue) exceeded all other encodings of text on the Web in 2007, and that by 2010 it was nearing 50%.<ref name="MarkDavis2010"/> Given that some ASCII (red) pages represent UTF-8 as entities, it is more than half.<ref name="html4_w3c">]]
The legend says "This may include pages containing only ASCII but marked as UTF-8. It may also include pages in CP1252 and other encodings mistakenly marked as UTF-8, these are relying on the browser rendering the bytes in errors as the original character set"... but, it is not the original idea, we can not count something "mistakenly marked as UTF-8", even it exist. The point is that there are a lot of ASCII pages that have symbols that web-browser map to UTF-8.
PubMed Central, for example, have 3.1 MILLION Articles in ASCII but using real UTF-8 by entity encode. No one is a mistake.
The old text (see thumb here) have a note <ref name="html4_w3c"> is: { { "HTML 4.01 Specification, Section 5 - HTML Document Representation", W3C Recommendation 24 December 1999. Asserts "Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding. (...) Character references are a character encoding-independent (...).". See also Unicode and HTML/Numeric character references.} }
This old text have also some confusion (!)... so I corrected to "Many ASCII (red) pages have also some ISO 10646 symbols representanted by entities,[ref] that are in the UTF-8 repertoire. That set of pages may be counted as UTF-8 pages."
--Krauss (talk) 22:45, 23 August 2014 (UTC)
- I reverted this as you seem to have failed to understand it.
- First, an Entity IS NOT UTF-8!!!!!!! They contain only ascii characters such as '&' and digits and ';'. They can be correctly inserted into files that are NOT UTF-8 encoded and are tagged with other encodings.
- Marking an ASCII file as UTF-8 is not a mistake. An ASCII file is valid UTF-8. However since it does not contain any multi-byte characters it is a bit misleading to say these files are actually "using" UTF-8.
- Marking CP1252 as UTF-8 is very common, especially when files are concatenated, and browsers recognize this due to encoding errors. This graph also shows these mis-identified files as UTF-8 but they are not really.
- Sorry about my initial confused text. Now we have another problem here, is about interpretation of W3C standards and statistics.
- 1. RFC 2047 (MIME Content-Transfer-Encoding) interpretarion used in the charset or enconding attributes of HTTP (content-type header with charset) and HTML4 (meta http-equiv): say what must be interpreted as "ASCII page" and what is a "UTF-8 page". Your assertion "an ASCII file is valid UTF-8" is a distortion of these considerations.
- 2. W3C standards, HTML4.1 (1999): say that you can add to an ASCII page some special symbols (ISO 10646 as expressed by the standard) by entities. Since before 2007, what all web-browser do, when typing special symbols, is replace the entity by an UTF-8 character (rendering the entity as its standard UTF-8 glyph).
- 3. Statistics: this kind of statistics report must use first the technical standard options and variations. These options have concrete consequences that can be relevant to the "counting web pages". The user mistakes may be a good statistical hypothesis testing, but you must first to prove that they exist and that they are relevant... In this case, you must to prove that the "user mistake" is more important than technical standard option. In an encyclopedia, we did not show unproven hypothesis, neither a irrelevant one.
- --Krauss (talk) 10:23, 24 August 2014 (UTC)
- An ASCII file is valid UTF-8. That's irrefutable fact. To speak of "its standard UTF-8 glyph" is a category error; UTF-8 doesn't have glyphs, as it's merely a mapping from bytes to Unicode code points.--Prosfilaes (talk) 21:23, 24 August 2014 (UTC)
- To elaborate on the second point above: Krauss in conflating "Unicode" and "UTF-8". They are not the same. A numerical character entity in HTML (e.g., ţ or ţ) is a way of representing a Unicode codepoint using only characters in the printable ASCII range. A browser finding such an entity will use the codepoint number from the entity to determine the Unicode character and will use its font repertoire to attempt to represent the character as a glyph. But this process does not involve UTF-8 encoding -- which is a different way of representing Unicode codepoints in the HTML byte stream. The ASCII characters of the entity might themselves be encoded in some other scheme: the entity in the stream might be ASCII characters or single-byte UTF-8 characters, or even UTF-16 characters, taking 2 bytes each. But the browser will decode them as ASCII characters first and then keying on the "&#...;" syntax use them to determine the codepoint number in a way that does not involve UTF-8. -- Elphion (talk) 21:58, 24 August 2014 (UTC)
- I agree the problem is that Krauss is confusing "Unicode" with "UTF-8". Sorry I did not figure that out earlier.Spitzak (talk) 23:28, 25 August 2014 (UTC)
- Our job as Wikipedia editors is not to interpret the standards, nor to determine what is and isn't appropriate to count as UTF-8 "usage". That job belongs to the people who write the various publications that we cite as references in our articles. Mark Davis's original post on the Official Google Blog, from whence this graph came and which we (now) correctly cite as its source, doesn't equivocate about the graph's content or meaning. Neither did his previous post on the topic. Davis is clearly a reliable source, even though the posts are on a blog, and we should not be second-guessing his claims. That job belongs to others (or to us, in other venues), and when counter-results are published, we should consider using them. RossPatterson (talk) 11:13, 25 August 2014 (UTC)
- Thanks for finding the original source. [2] clearly states that the graph is not just a count of the encoding id from the html header, but actually examines the text, and thus detects ASCII-only (I would assume also this detects UTF-8 when marked with other encodings, and other encodinds like CP1252 even if marked as UTF-8): "We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example... Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8)." The caption needs to be fixed up.Spitzak (talk) 23:28, 25 August 2014 (UTC)
- Krauss nicely points out below Erik van der Poel's methodology at the bottom of Karl Dubost's W3C blog post, which makes it explicit that the UTF-8 counts do not include ASCII: "Some documents come with charset labels declaring iso-8859-1, windows-1252 or even utf-8 when the byte values themselves are never greater than 127. Such documents are pure US-ASCII (if no ISO 2022 escape sequences are encountered).". RossPatterson (talk) 17:24, 27 August 2014 (UTC)
- Thanks for finding the original source. [2] clearly states that the graph is not just a count of the encoding id from the html header, but actually examines the text, and thus detects ASCII-only (I would assume also this detects UTF-8 when marked with other encodings, and other encodinds like CP1252 even if marked as UTF-8): "We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example... Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8)." The caption needs to be fixed up.Spitzak (talk) 23:28, 25 August 2014 (UTC)
Wow, a lot of discussion! So many intricate nuances of interpretations, sorry I was to imagine something more simple when started...
- "Unicod" vs "UTF8": Mark Davis use "Unicod (UTF8...)" in the legend, and later, in the text, express "As you can see, Unicode has (...)". So, for his public, "Unicode" and "UTF8" are near the same thing (only very specialized public fells pain with it). Here, in our discussion, is difficult to understand what the technical-level we must to use.
- About Mark Davis methodology, etc. no citation, only a vague "Every January, we look at the percentage of the webpages in our index that are in different encodings"...
But, SEE HERE similar discussion, by those who did the job (the data have been compiled by Erik van der Poel) - Trying an answer about glyph discussion: the Wikipedia glyph article is a little bit confuse (let's review!); see W3C use of the term. In a not-so-technical-jargon, or even in the W3C's "loose sense", we can say that there are a set of "standard glyphs/symbols" that are represented in a subset of "UTF-8-like symbols", and are not in ASCII neither CP1252 "symbols"... Regular people see that "ASCII≠CP1252" and "UTF8≠CP1252"... So, even regular people see that "ASCII≠UTF8" in the context of the illustration, and that HTML-entities are maped to something that is a subset of UTF8-like symbols.
Mark Davis not say any thing about HTML-entities or about "user mistakes", so, sugestion: let's remove it from article's text.
--Krauss (talk) 03:33, 26 August 2014 (UTC)
- Neither W3C page you point to says anything about UTF-8, and I don't have a clue where you're getting "UTF-8-like symbols" from. Unicode is the map from code points to symbols and all the associated material; UTF-8 is merely a mapping from bytes to code points. The fact that it can be confusing to some does not make it something we should conflate.--Prosfilaes (talk) 06:00, 27 August 2014 (UTC)
- My text only say "W3C use of the term" (the term "glyph" not the term "UTF-8"), and there (at the linked page) are a table with a "Glyph" column, with images showing the typical symbols. This W3C use of the term "glyph" as typical symbol, conflicts with the Wikipedia's thumb illustration with the text "various glyphs representing the typical symbol". Perhaps W3C is wrong, but since 2010 we need Refimprove (Wikipedia's glyph definition "needs additional citations for verification").
- About my bolded sugestion, "let's remove it", ok? need to wait or vote, or we can do it now? --Krauss (talk) 12:38, 27 August 2014 (UTC)
- I'm confused. Is Krauss questioning Mark Davis's reliability as a reference for this article? It seems to me that the graphs he presents are entirely appropriate to this article, especially after reading Erik van der Poel's methodology, as described in his 2008-05-08 post at the bottom of Karl Dubost's W3C blog post, which is designed to recognize UTF-8 specifically, not just Unicode in general. RossPatterson (talk) 17:16, 27 August 2014 (UTC)
- Sorry my english, I supposed Mark Davis and W3C as reliable sources (!). I think Mark Davis and W3C write some articles to the "big public" and other articles to the "specialized technical people"... We here can not confront "specialized text" with "loose text", even of the same author: this confrontation will obviously generates some "false evidence of contradiction" (see ex. "Unicode" vs "UTF8", "glyph" vs "symbol", etc. debates about correct use of the terms). About Erik van der Poel's explanations, well, this is other discussion, where I agree your first paragraph about it, "Our job as Wikipedia editors (...)". Now I whant only to check the sugestion ("let's remove it from article's text" above). --Krauss (talk) 11:13, 28 August 2014 (UTC)
It appears this discussion is moot - the graph image has been proposed for deletion 2 days from now. RossPatterson (talk) 03:41, 30 August 2014 (UTC)
- Thanks, fixed. --Krauss (talk) 17:25, 30 August 2014 (UTC)
Backward compatibility:
Re: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.
I'm a non-guru struggling with W3C's strong push to UTF-8 in a world of ISO-8859-1 and windows-1252 text editors, but either I have misunderstood this completely or else it is wrong? Seven-bit is the same in ASCII or UTF-8, sure; but in 8-bit extended ASCII (whether "extended" to ISO-8859-1, windows-1252 or whatever), a byte with the MSB "on" is one byte in extended ASCII, two bytes in UTF-8. A parser expecting "8-bit extended ASCII" will treat each of the UTF-8 bytes as a character. Result, misery. Or have I missed something? Wyresider (talk) 19:18, 5 December 2014 (UTC)
- No, it is not a problem unless your software decides to take two things that it thinks are "characters" and insert another byte in between them. In 99.999999% of the cases when reading the bytes in, the bytes with the high bit set will be output unchanged, still in order, and thus the UTF-8 is preserved. You might as well ask how programs handle english text when they don't have any concept of correct spelling and each word is a bunch of bytes that it looks at individually. How do the words get read and written when the program does not understand them? It is pretty obvious how it works, and this is why UTF-8 works too.Spitzak (talk) 19:52, 5 December 2014 (UTC)
- Wyresider -- This has been discussed in the article talk archives. Most of the time, if a program doesn't mess with what it doesn't understand, or treats sequences of high-bit-set characters as unanalyzable units, then simple filter etc. programs will often pass non-ASCII UTF8 characters through unaltered. It's a design feature of UTF8 which is designed to lighten the programming load of transition from single-byte to UTF8 -- though certainly not an absolute guarantee of backward compatibility... AnonMoos (talk) 14:51, 7 December 2014 (UTC)
- In a world of ISO-8859-1 and Windows-1252 text editors? What world is that? I live in a world where the most-spoken language is Chinese, which clears a billion users alone, and the text editors that come with any remotely recent version of Linux, Windows or Macs, or any version of Android or iOS, support UTF-8 (or at least Unicode). There's no magic button that makes UTF-8 work invariably with systems expecting 8-bit extended ASCII (or Windows-1252 with systems expecting 8-bit extended ASCII to not use C1 control codes 80-9F), but UTF-8 works better then, say, Big5 (which uses sub-128 values as part of multibyte characters) or ISO-2022-JP (which can use escape sequences to define sub-128 values to mean a character set other then ASCII).--Prosfilaes (talk) 13:45, 8 December 2014 (UTC)
- Wikipedia Talk pages are not a forum, but to be crystal clear, ASCII bytes have a high bit of zero and are UTF-8-clean, and anything that has a high bit of one isn't ASCII and will almost certainly have some bytes that will be treated differently in a UTF-8 context. A parser expecting data encoded in Windows codepage 1252 or in ISO 8859-1 isn't parsing ASCII, and won't understand UTF-8 correctly. RossPatterson (talk) 00:09, 9 December 2014 (UTC)
- There are many parsers that don't expect UTF-8 but work perfectly with it. An example is the printf "parser". The only sequence of bytes it will alter starts with an ascii '%' and contains only ascii (such as "%0.4f"). All other byte sequences are output unchanged. Therefore all multibyte UTF-8 characters are preserved. Another example is filenames, on Unix for instance the only bytes that mean anything are NUL and '/', all other bytes are considered part of the filename, and are not altered. Therefore all UTF-8 multibyte characters can be parts of filenames.Spitzak (talk) 02:24, 9 December 2014 (UTC)
- Thank you for the explanation. --MarkFilipak (talk) 20:10, 23 February 2015 (UTC)
- Thank you for the explanation. --MarkFilipak (talk) 20:11, 23 February 2015 (UTC)
Many errors
I'm not an expert here, but I am an engineer and I do recognize when I read something that's illogical.
There are 2 tables:
https://en.wikipedia.org/wiki/UTF-8#Description
https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
They cannot both be correct. If the one following #Description is correct, then the one following #Codepage_layout must be wrong.
Embellishing on the table that follows #Description:
1-byte scope: 0xxxxxxx = 7 bits = 128 code points.
2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 additional code points.
3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.
4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.
The article says: "The next 1,920 characters need two bytes to encode". As shown in the #Description table, 11 bits add 2048 code points, not 1920 code points. The mistake is in thinking that the 1-byte scope and the 2-byte scope overlap so that the 128 code points in the 1-byte scope must be deducted from the 2048 code points in the 2-byte scope. That's wrong. The two scopes do not overlap. They are completely independent of one another.
The text following the #Codepage_layout table says: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add." This implies there's a scope that looks like this:
2-byte scope: 01111111 10xxxxxx = 6 bits = 64 additional code points.
While that's possible, it conflicts with the #Description table. These discrepancies seem pretty serious to me. So serious that they put into doubt the entire article.
MarkFilipak (talk) 03:13, 23 February 2015 (UTC) )
- 11000001 10000000 encodes the same value as 01000000. So, yes, they do overlap.--Prosfilaes (talk) 03:24, 23 February 2015 (UTC)
- Huh? The scopes don't overlap. Perhaps you mean that they map to the same glyph? Are you sure? I don't know because I've not studied the subject, If this is a logical issue with me, it's probably a logical issue with others. Perhaps a section addressing this issue is appropriate, eh?
- Also, what about the "Orange cells" text and the 2-bit scope I've added to be consistent? That scope conflicts with the other table. Do you have a comment about that? Thank you. --MarkFilipak (talk) 03:53, 23 February 2015 (UTC)
- Perhaps this is what's needed. What do you think?
- 1-byte scope: 0xxxxxxx = 7 bits = 128 code points.
- 2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 alias code points that map to the same points as 0xxxxxxx.
- 2-byte scope: 1100001x 10xxxxxx = 7 bits = 128 additional code points.
- 2-byte scope: 110001xx 10xxxxxx = 8 bits = 256 additional code points.
- 2-byte scope: 11001xxx 10xxxxxx = 9 bits = 512 additional code points.
- 2-byte scope: 1101xxxx 10xxxxxx = 10 bits = 1024 additional code points.
- 3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.
- 4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.
- --MarkFilipak (talk) 05:54, 23 February 2015 (UTC)
- For the first question, it seems you don't understand "code points" the way the article means. "Code points" here refer to Unicode code points. The unicode code points are better described in the Plane (Unicode) article. In the UTF-8 encoding the Unicode code points (in binary numbers) are directly mapped to the x:es in this table:
- 1-byte scope: 0xxxxxxx = 7 bits = 128 possible values.
- 2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 possible values.
- 3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 possible values.
- 4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 possible values.
- That means that in the 2-byte scheme you could encode the 2048 first code points, but you are not allowed to encode the first 128 code points, as described in the Overlong encodings section. And similarly it would be possible to encode all the 65536 first code points in the 3-byte scheme, but you are only allowed to use the 3-byte scheme from the 2049th code point. And the 4-byte scheme is used from the 66537th to the 1114112th (the last one) code point.
- For your second question, continuation bytes (starting with 10) are only allowed after start bytes (starting with 11), not after "ascii bytes" (starting with 0). The "ascii bytes" are only used in the 1-byte scope. Boivie (talk) 10:58, 23 February 2015 (UTC)
- Thanks for your reply. What you wrote is inconsistent with what Prosfilaes wrote, to wit: "11000001 10000000 encodes the same value as 01000000." Applying the principle of Overlong encodings, it seems to me that "11000001 10000000" is an overlong encoding (i.e., the encoding is required to be "01000000"), therefore, unless I misunderstand the principle of overlong encoding, what Prosfilaes wrote about "11000001 10000000" is wrong. I'll let you two work it out between you.
- It occurs to me that the "1100000x 10xxxxxx" scope could therefore be documented as follows:
- 2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 illegal codings (see: Overlong encodings).
- Should it be so documented? Would that be helpful?
- Look, I don't want to be a pest, but this article seems inconsistent and to lack comprehensiveness. I have notions, not facts, so I can't "correct" the article. I invite all contributors who do have the facts to consider what I've written. I will continue to contribute critical commentary if encouraged to do so, but my lack of primary knowledge prohibits me from making direct edits on the source document. Regards --MarkFilipak (talk) 15:12, 23 February 2015 (UTC)
- I see nothing wrong with Prosfilaes' comment. Using the 2-bit scheme 11000001 10000000 would decode to code point 1000000, even if it would be the wrong way to encode it. I am also not sure it would be helpful to include illegal byte sequences in the first table under the Description header. It is clearly stated in the table from which code point to which code point each scheme should be used. The purpose of the table seem to be to show how to encode each code point, not to show how not to encode something. Boivie (talk) 17:16, 23 February 2015 (UTC)
- 1, Boivie, you wrote, "I see nothing wrong with Prosfilaes' comment." Assuming that you agree that "11000001 10000000" is overlong, and that overlong encodings "are not valid UTF-8 representations of the code point", then you must agree that "11000001 10000000" is invalid. How can what Prosfilaes wrote, "11000001 10000000 encodes the same value as 01000000," be correct if it's invalid? If it's invalid, then "11000001 10000000" doesn't encode any Unicode code point. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)
- 2, Regarding whether invalid encodings should be shown as invalid so as to clarify the issue in readers' minds, I ask: What's wrong with that? I assume you'd like to make the article as easily understood as possible. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)
- 3, Regarding the Orange cells quote: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add", it is vague and misleading because,
- 3.1, those cells don't add 6 bits, they cause a whole 2nd byte to be added, and
- 3.2, they don't actually add because the 1st byte ("0xxxxxxx") doesn't survive the addition -- it's completely replaced.
- Describing the transition from
- this: 00000000, 00000001, 00000010, ... 01111111, to
- this: 11000010 10000000, 11000010 10000001, 11000010 10000010, ... 11000010 10111111,
- as resulting from "the 6 bits they add" is (lacking the detail I supply in the 2 preceding sentences) going to confuse or mislead almost all readers. It mislead me. Now that I understand the process architecture, I can interpret "the 6 bits they add" as sort of a metaphorical statement, but there is a better way.
- My experience as an engineer and documentarian is to simply show the mapping from inputs to outputs (encodings to Unicode code points in this case) and trust that readers will see the patterns. Trying to explain the processes without showing the process is not the best way. I can supply a table that explicitly shows the mappings which you guys can approve or reject, but I need reassurance up front that what I produce will be considered. If not open to such consideration, I'll bid you adieu and move on to other aspects of my life. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)
- The tables are correct. The "scopes" as you call them do overlap. Every one of the lengths can encode a range of code points starting at zero, therefore 100% of the smaller length is overlapped. However UTF-8 definition further states that when there is this overlap, only the shorter version is valid. The longer version is called an "overlong encoding" and that sequence of bytes should be considered an error. So the 1-byte sequences can do 2^7 code points, or 128. The 2-byte sequences have 10 bits and thus appear to do 2^10 code points or 2048, but exactly 128 of these are overlong because there is a 1-byte version, thus leaving 2048-128 = 1920, just as the article says. In addition two of the lead bytes for 2-byte sequeces can *only* start an overlong encoding, so those bytes (C0,C1) can never appear in valid UTF-8 and thus are colored red in the byte table.Spitzak (talk) 20:02, 23 February 2015 (UTC)