Jump to content

Talk:UTF-8: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
m Reverted 1 edit by 37.248.159.212 (talk) to last revision by Spitzak
 
(381 intermediate revisions by 83 users not shown)
Line 1: Line 1:
{{Talk header|search=yes}}
{{Talk header|noarchive=yes|search=no}}
{{WikiProject banner shell|class=B|vital=yes|
{{WikiProjectBannerShell|1=
1=
{{WikiProject Computing|class=B|importance=Top}}
{{WikiProject Computer science|class=B|importance=mid}}
{{WikiProject Computing|importance=Mid}}
{{WikiProject Typography|class=B|importance=Top}}
{{WikiProject Computer science|importance=mid}}
{{WikiProject Typography|importance=Mid}}
}}
}}
{{User:MiszaBot/config
{{User:MiszaBot/config
|maxarchivesize = 100K
|maxarchivesize = 100K
|counter = 3
|counter = 5
|algo = old(90d)
|algo = old(730d)
|archive = Talk:UTF-8/Archive %(counter)d
|archive = Talk:UTF-8/Archive %(counter)d
}}
}}
Line 15: Line 16:
|mask=/Archive <#>
|mask=/Archive <#>
|leading_zeros=0
|leading_zeros=0
|indexhere=yes}}
|indexhere=yes
}}{{archive box|search=es}}


== Table should not only use color to encode information (but formatting like bold and underline) ==
== 5- and 6-byte encodings ==


As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code.
UTF-8, as it stands, does not know 5- and 6-byte encodings, that’s a fact. Having this ''very important'' fact buried in the third paragraph after the table of “design of UTF-8 as originally proposed” is just misleading. I would even prefer a table with those encodings ''removed'' altogether, which would be still better than the current version. I agree it might be good to show them, but we need to be very clear there is an important caveat in there. I fail to see how a slight color in background could be confusing (maybe the single-cell in the 4-byte encodings? I do not insist on that one), I was more afraid of a (correct) reminder about [[WP:ACCESSIBILITY|accessibility]] than that.
Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices. <!-- Template:Unsigned --><small class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:88.219.179.109|88.219.179.109 ]] ([[User talk:88.219.179.109#top|talk]] • [[Special:Contributions/88.219.179.109|contribs]]) 02:26, 17 April 2020‎ (UTC)</small>


== Microsoft script dead link ==
If you dislike coloring, which device would you find acceptable? Maybe just a thicker line below the 4-byte row? Whatever, it just needs ''some'' distinction.


and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad
--[[User:Mormegil|Mormegil]] ([[User talk:Mormegil|talk]]) 17:20, 24 February 2013 (UTC)


"Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.
: I don't agree that some distinction is required in the table. It is presented as a table illustrating the original design. That's a fact, as you say. If you're concerned that people won't get the message that encodings conforming to RFC 3629 limit the range, then move that proviso into the sentence introducing the table. Trying to indicate it graphically in the table will just muddy the idea behind the design, and will require more explanatory fine print. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 20:57, 24 February 2013 (UTC)


https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()
::Basically agree with Elphion... [[User:AnonMoos|AnonMoos]] ([[User talk:AnonMoos|talk]]) 00:46, 25 February 2013 (UTC)


This link is dead. How to fix it? <!-- Template:Unsigned --><small class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Un1Gfn|Un1Gfn]] ([[User talk:Un1Gfn#top|talk]] • [[Special:Contributions/Un1Gfn|contribs]]) 02:58, 5 April 2021 (UTC)</small>
::Agree here too. It clearly states this is the *ORIGINAL* design. The reason the table is used is that the repeating pattern is far easier to see with 6 lines than with 4. It is immediately followed with a paragraph that gives the dates and standards where the design was truncated. I also think this is a far clearer to show the 6 rows and then state that the last two were removed, than to show 4 rows and then later show the 2 rows that were removed (and the 5/6 byte sequences are an important part of UTF-8 history so they must be here). ([[User talk:Spitzak|talk]]) 06:35, 25 February 2013 (UTC)
:That text, and that link, appears to have been removed, so there's no longer anything to fix. [[User:Guy Harris|Guy Harris]] ([[User talk:Guy Harris|talk]]) 23:43, 21 December 2023 (UTC)


== The article contains "{<nowiki/>{efn", which looks like a mistake. ==
::I think the table should not include the obsolete 5/6 byte sequences, at all. Very misleading - it fooled me. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/90.225.89.28|90.225.89.28]] ([[User talk:90.225.89.28|talk]]) 12:28, 2 June 2013 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->


I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. [[Special:Contributions/2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094|2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094]] ([[User talk:2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094|talk]]) 16:17, 2 April 2024 (UTC)
::The table shall not contain the 5-6 byte sequences. The article shall present first what UTF-8 is ''today'', and then, as a separate section, describe what it was many years ago. It is very confusing to present the material in the order of chronological development. Keep in mind that many come here to take a quick reference for UTF-8 as it is today, and the history is not that important for them. In other words, the article shall present the material in "the most important things go first" order. [[User:Ybungalobill|bungalo]] ([[User talk:Ybungalobill|talk]]) 12:16, 7 September 2013 (UTC)


:I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. [[User:Indefatigable|Indefatigable]] ([[User talk:Indefatigable|talk]]) 19:03, 2 April 2024 (UTC)
::There is still a lot of confusion among programmers, who think that UTF-8 can be as long as 6 bytes, and therefore it's "bad". Looking at this article, or "Joel on unicode" say, explains why. I blame you for this confusion. Many readers will look at the diagrams only, and don't bother to read the text. It is legitimate, and you should take them into consideration. That UTF-8 was once a 6-byte encoding is irrelevant for anything but historical curiosity. [[User:Ybungalobill|bungalo]] ([[User talk:Ybungalobill|talk]]) 12:42, 7 September 2013 (UTC)


== Should "The Manifesto" be mentioned somewhere? ==
::: It's legitimate for a programmer to only look at the diagram? I'm not sure how that makes sense; if you understand the syllogism "UTF-8 can be as long as 6 bytes" -> "it's bad" you should understand enough about Unicode to understand that UTF-8 is not 6 bytes long. In any case, I don't know of any force that can stop a hypothetical programmer that dismisses a technology based on <s>reading</s> <s>skimming</s> looking at the diagrams in a Wikipedia article.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 21:06, 7 September 2013 (UTC)


More specifically, this one: https://utf8everywhere.org <!-- Template:Unsigned --><small class="autosigned">--&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Rudxain|Rudxain]] ([[User talk:Rudxain#top|talk]] o [[Special:Contributions/Rudxain|contribs]]) 21:52, 12 July 2024 (UTC)</small> <!--Autosigned by SineBot-->
Ybungalobill -- if those people just have the patience to scroll down to the big table, then they can see things that should be avoided highlighted in bright red... [[User:AnonMoos|AnonMoos]] ([[User talk:AnonMoos|talk]]) 23:07, 7 September 2013 (UTC)


:Only if it's got significant coverage in [[WP:reliable source|reliable source]]s. [[User:Remsense|<span style="border-radius:2px 0 0 2px;padding:3px;background:#1E816F;color:#fff">'''Remsense'''</span>]][[User talk:Remsense|<span lang="zh" style="border:1px solid #1E816F;border-radius:0 2px 2px 0;padding:1px 3px;color:#000"></span>]] 22:10, 12 July 2024 (UTC)
Keep the original table. The cutoff is in the ''middle'' of the 4-byte sequences, so I do not believe truncating the table between 4 and 5 byte sequences makes any sense. The longer sequences make the pattern used much more obvious.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 21:07, 8 September 2013 (UTC)


:It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing [[Windows NT 3.1]], and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... [[User:AnonMoos|AnonMoos]] ([[User talk:AnonMoos|talk]]) 20:30, 15 July 2024 (UTC)
UTF-8 supports 5- and 6-byte values perfectly well - UNICODE doesn't use them, and thus UNICODE-in-UTF-8 is restricted to the more limited range. (to belabor a point) Encoding high-end UTF-8 beyond the UNICODE range is perfectly legitimate, just don't call it UNICODE - unless UNICODE itself has (in some probably near future) expanded beyond the range it's using today. (more belaboring) The 0x10FFFF is a UNICODE-specific constraint, not one of UTF-8.--<small><span class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:70.112.90.192|70.112.90.192]] ([[User talk:70.112.90.192|talk]] • [[Special:Contributions/70.112.90.192|contribs]]) </span></small><!-- Template:Unsigned -->


== The number of 3 byte encodings is incorrect ==
: Unicode = ISO 10646 or UCS. UTF = UCS Transformation Format. That is, what UTF-8 is designed to process doesn't use values above 0x10FFFF, and so 5- and 6-byte values are irrelevant. There's no anticipation of needing them; there's 1,000 years of space at the current rate of growth of Unicode, which is expected to trend downward.
: You can encode stuff beyond 0x10FFFF, but it's no longer a UCS Transformation Format. I'm not sure why you'd do this--hacking non-text data into a nominal text stream?--but it's a local hack, not something that has ever been useful nor something that is widely supported.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 12:57, 28 February 2014 (UTC)


This sentence is incorrect:
:: No, what the UTF-8 encoding scheme was "designed to process" was the full 2^31 space. The UTF-8 standard transformation format uses it only for the Unicode codepoints, and a compliant UTF-8 decoder would report out-of-range values as errors. I think we make that abundantly clear in the article. But "1,000 years of space at the current rate of growth" reminds me of "640K ought to be enough for anybody". Whether we'll ever need to look for larger limits is a moot point. There's no particular reason to prohibit software from considering such sequences. And it's certainly not a good reason to obscure the history of the scheme. I think the article currently strikes the right balance between history and current practice. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 18:26, 5 March 2014 (UTC)


Three bytes are needed for the remaining 61,440 codepoints...
::: It is incoherent to say "the full 2^31 space" without the context that implies "the full 2^31 space of Unicode". So it's not "no"; and in fact, I would say the emphasis is wrong; they wanted to support Unicode/ISO 10646, no matter what its form, not the 2^31 space. There is good reason to stop software from considering such sequences; "if you find F5, reject it" is much safer then adding poorly-tested code to process it, just to reject it at later level, and discouraging ad-hoc extensions to standard protocols is its own good. libtiff has had security holes because it supported features that that nobody had noticed hadn't worked in years. Whether we'll ever need to look for larger limits is not a moot point; writing unneeded, possibly buggy code for a situation that may come up is not wise.


FFFF - 0800 + 1 = F800 = 63,488 three byte codepoints.
::: If you want a copy of every book [[Harper Lee]] wrote, how many bookcases are you going to put up? Personally, I'm not going to put up multiple bookcases on the nigh-inconceivable chance that somehow dozens of new books are going to appear from her pen. We knew that memory was something people were going to use more of, but every single character anyone can think of encoding, including many that nobody cares about, fits on four Unicode planes, some 240,000 characters, with plenty of blank space.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 03:43, 6 March 2014 (UTC)


The other calculations for 1, 2, and 4 byte encodings are correct. [[User:Bantling66|Bantling66]] ([[User talk:Bantling66|talk]]) 02:56, 23 August 2024 (UTC)
:: It is not incoherent: everybody (even you) knows what is meant. The scheme was designed when Unicode was expected to include 2^31 codepoints, and that is what the scheme was designed to cover. As for broken software, nothing you say will prevent it from being written. The only reasonable defense is to write and promote good software. Software that parses 5 and 6 byte sequences as well as unused 4 byte sequences is not necessarily bad software. In terms of safety, I would argue that well tested parsing routines that handle 5- and 6-bytes sequences are inherently safer than adding special case rejections at an early stage. It is certainly a more flexible approach. And the analogy with physical bookcases is not particularly apt; keeping code flexible adds only minimal overhead. And in any event, your opinion or mine about how software ''should'' go about handling out-of-range sequences is really beyond the scope of this article. It suffices that a compliant reader report the errors. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 14:38, 6 March 2014 (UTC)


:You forgot to subtract 2048 [[Universal Character Set characters#Surrogates|surrogates]] in the D800–DFFF range. – <i style="text-transform:lowercase">MwGamera</i> ([[User talk:MwGamera|talk]]) 08:58, 23 August 2024 (UTC)
::: It is incoherent outside that context, and once we explicitly add that context it changes things. What it was designed to process is ISO-10646; the fact that they planned for a lot larger space is a minor detail. In terms of safety, your saying that well-tested parsing routines that have <F5> => error are less safe then <F5>... => some number that has to be filtered away later? If you believe your opinion about this subject is beyond scope, then don't bring it up. The simple fact is that UTF-8 in the 21st century only supports four byte sequences, that no encoder or decoder in history has ever had reason to handle anything longer. Emphasis should be laid on what it is, not what it was.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 23:23, 6 March 2014 (UTC)


== Multi-point flags ==
:::: "You keep using that word. I do not think it means what you think it means." (:-) -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 15:40, 7 March 2014 (UTC)


I'm struggling to assume good faith here with [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1246173477 this edit]. A flag which consists of five code points is already sufficiently illustrative of the issue being discussed. That an editor saw fit to first remove that example without discussion, and then to swap it out for the other example when it was pared down to one flag, invites discussion of why ''that particular flag'' was removed, and the obvious answer isn't a charitable one. [[User:Thumperward|Chris Cunningham (user:thumperward)]] ([[User talk:Thumperward|talk]]) 12:35, 17 September 2024 (UTC)
:::: The original design did in fact aim to cover the full 2^31 space. Ken Thompson's proposal [https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt] states: "The proposed UCS transformation format encodes UCS values in the range [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, 5, and 6 bytes." -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 16:08, 7 March 2014 (UTC)


:Yes it was restored to the pride flag for precisely the reasons you state. [[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 20:48, 17 September 2024 (UTC)
::::: The original design did cover the then-full 2^31 space. But that's in the technical part of the document; the aim of UTF-8 is stated above:
::A better, more in-depth explanations of the flags can be found on the articles [[regional indicator symbol]] and [[Tags_(Unicode_block)#Current_use]] (the mechanism for these specific flags). I don't think it belongs in articles of specific [[character encoding]]s like UTF-8 at all.
::The fact that one [[code point]] does not necessarily produce one [[grapheme]] has nothing to do with a specific [[character encoding]] like [[UTF-8]]. It's a more fundamental property of the text itself and any encoding that can be used to encode some string of characters decodes back to the same characters when decoded back from the binary representation. Although very popular, UTF-8 is just one of the numerous ways to encode text to binary and back.
:: I wrote more about this below at [https://en.wikipedia.org/wiki/Talk:UTF-8#Other_issues_in_the_article Other issues in the article] and sadly only then noticed this was already being somewhat discussed here. [[User:Mossymountain|Mossymountain]] ([[User talk:Mossymountain|talk]]) 10:45, 20 September 2024 (UTC)


== Why was the "heart" of the article, almost the whole section of [[UTF-8#Encoding]] ([https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1245874905#Encoding Old revision]) removed instead of adding a note? ==
::::::With the approval of ISO/IEC 10646 (Unicode) as an international standard and the anticipated wide spread use of this universal coded character set (UCS), it is necessary for historically ASCII based operating systems to devise ways to cope with representation and handling of the large number of characters ''that are possible to be encoded by this new standard.''
<small>NOTE: The section seems to have been renamed ([[UTF-8#Encoding]] -> [[UTF-8#Description]]) in [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245932095 this edit].</small>


I don't understand why such a large part of [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1245874905#Encoding UTF-8#Encoding (old revision)] was suddenly removed in [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875555 this edit (edit A)], and then [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875706 this edit (edit B)] ([https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=1245875706&oldid=1245874905 diff after both edits]) instead of either:
::::: So, no, it did not aim to cover the full 2^31 space, it aimed to handle "the large number of characters that are possible to be encoded by this new standard."--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 22:28, 7 March 2014 (UTC)
* Adding a note about parts of it being written poorly.
* Rewriting some of it. (the best and the most difficult option)
* Carefully considering removing parts that were definitely redundant (such as arguably the latter part of [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1245874905#Examples UTF-8#Examples (old revision)]).


Both of the edits removed a separate, and quite a well-written example (at least for my brain, these very examples made understanding UTF-8 require significantly less effort spent thinking). I don't think removing them was a good decision. Yes, you ''could'' explain basically anything without using examples, but in my experience an example is usually the easiest and fastest way for someone to understand almost any concept, especially when the examples were so visual and beautifully simple. I see it in the same category as a lecturer speaking with his hands and writing+drawing relevant things on a whiteboard versus having to hold the lecture by speaking over the phone.
::::::That is a weird interpretation of that sentence. That some characters are "possible to be encoded" does not say anything about what "could" be encoded by that method. &minus;[[User:Woodstone|Woodstone]] ([[User talk:Woodstone|talk]]) 06:02, 8 March 2014 (UTC)


===The 1st, [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875555 edit A]===
::::::: I don't understand your response. "Could" and "possible" mean basically the same thing. I think that sentence is their goal, to cover the characters of Unicode, not the 2^31 space.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 22:01, 8 March 2014 (UTC)
{{Quote frame
|text=[https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875555#Encoding →‎Encoding]: this entire section is almost completely opaque and its inclusion stymies the addition of some clear prose describing how unicode is decoded
|author=[[user:Thumperward]]
|source=([https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875555 edit A])
}}
To me, this reads as <s>if [[UTF-8]] was accidentally conflated with [[Unicode]], causing a mistake to remove the parts '''from the wrong article'''</s> (Having thought about it more, I now think it's) a severe disagreement of article design/presentation style.
<br><small>(I still think edit notes asking for rewrites would have been the way to go instead of nuking the information, and that for some of the items, an article-like rewrite would be the wrong choice: Some data is way more enjoyable and simple to read visually from a table than it is to glean from written or spoken word and, as such, should be visualized in a table.)</small>


I am strongly of the mind that the deleted parts included the two '''most important parts of the whole article''', that must definitely be included as they are the very core of the article:
Hi, I just wanted to say that I was using this article for research, and I also found the table to be confusing. It isn't inherently wrong, but as-is it belongs in a History or Technical Background section, not at the top of Description which should reflect current standards and practice. If the table does stay, I think it should be updated to clarify current usage *within the table itself* with a note, color coding, etc. Perhaps we can unite around the general principle that tables/charts/diagrams should be self sufficient, and not rely on surrounding prose for critical clarifications. [[User:Proxyma|Proxyma]] ([[User talk:Proxyma|talk]]) 15:03, 6 July 2014 (UTC)
# The [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1245874905#Codepage_layout UTF-8#Codepage layout (old revision)], in my opinion the most important part of '''any article about a [[character encoding]]'''. This part was in my opinion also designed, formatted and written exemplarily well here. The colour palette could be adjusted accordingly if it's a problem for the colour-blind.<br>{{anchor|Precedents/examples}}- '''Precedents/Examples in other articles about specific character encodings:'''
#* Variable multi-byte ("cousins of UTF-8"):
#** [[Shift_JIS#Shift_JIS_byte_map]]
#** [[GBK_(character_encoding)#Encoding]]
#* single byte:
#** [[ASCII#Character_set]] {{Tooltip|(a strict subset of UTF-8)|Any valid ASCII is valid UTF-8 and has the exact same meaning in both.&#10;(The other way around this isn't always true.)}}
#** [[Code_page_437#Character_set]]
#** [[ISO/IEC_8859-1#Code_page_layout]]
# The first list (numbered 1..7) of [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1245874905#Examples UTF-8#Examples (old revision)] that clearly, by a singular simple example demonstrates how UTF-8 works. (I agree it could be rewritten, the language used is quite verbose)


<small>Sweeping the less important items under these rugs to make this seem shorter:</small>
:No, there is no reason to have two very similar tables. In addition the pattern is much easier to see with the 5 & 6 byte lines. Furthermore, a table "reflecting current usage" would have to somehow stop in the *middle* of the 4-byte line. Including the entire 4-byte line is misleading. Nobody seems to have any idea how to do that. Please leave the table as-is. This has been discussed enough.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 02:38, 7 July 2014 (UTC)
{{cot|The 2nd, edit B}}
===The 2nd, [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875706 edit B]===
{{Quote frame
|text=[https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875706#Encoding →Encoding]: this now refers to removed text and contradicts repeated assertions elsewhere that overlong encodings are unnecessary
|author=[[user:Thumperward]]
|source=([https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245875706 edit B])
}}
This edit removed the whole section [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1245874905#Overlong_encodings UTF-8#Overlong encodings (old revision)]. I disagree with its removal.
# The example removed in this edit was a clear and easy to understand way of explaining what an overlong encoding means.
# I don't understand ''what'' the deleted text is referred to have contradicted, unless this is something like the mention in [[UTF-8#Implementations and adoption]] of {{Tooltip|Java's "Modified UTF-8" that uses an overlong encoding for the [[null character]]|This makes it its own, separate character encoding.}}. Overlong encodings aren't merely "unnecessary", they are *utterly forbidden*/invalid/illegal.
#* Apart from the {{Tooltip|lacking citation, which probably should have been [https://datatracker.ietf.org/doc/html/rfc3629#section-3 rfc3629 § 3]|The last paragraph of the section, in which the only occurrence of the word "overlong" appears in the document.}}, I don't understand what was wrong with the second paragraph. I also consider the information presented in it essential for the article. (A simple decoder implementation could easily just pass the overlong encodings as if they were single-byte characters, or choose to simplify encoding by using a fixed length. The paragraph gives two good reasons why such encodings are illegal, that are now completely gone from the article.)
{{cob}}
{{cot|About removing helper colours and edit C}}
{{anchor|About removing helper colours}}
{{anchor|Helper colours}}
===The 3rd, [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245876774 edit C]===
This is about font colouring on [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1210691875#Encoding UTF-8#Encoding (old version)], it reverts [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1210691875 this edit] by [[User:Nsmeds]]. The textual information stays the same between the two, the edit only removes the custom colours.


I would prefer some form of colouring to be added back.
::This discussion seems to be based on different opinions about what is easier and more straightforward, so it's hard for me to see how the case has been closed. I gave my feedback because as a new reader I experienced the confusion others warned about here, and I think it's important to focus on the semi-casual reader. Perhaps it's human irrationality, but when readers see a big chart at the top, they interpret it as authoritative, and wouldn't consider parsing the rest of the text to see if it's later contradicted. I agree that two similar charts may be overkill, but in that case we should remove the one which has been inaccurate for more than a decade. [[User:Proxyma|Proxyma]] ([[User talk:Proxyma|talk]]) 03:03, 7 July 2014 (UTC)
<br>Properly selected helper colours shouldn't be against anything:

<br>I don't think <code><nowiki>{{colorblind|section}}</nowiki></code> or [[Wikipedia:Manual_of_Style/Accessibility#Color]] are at all suggesting the wiping of non-essential helper colours when they could only be potentially hard to distinguish from '''each other'''. What is definitely suggested instead is fixing situations with colouring that can make the text hard to read (colouring that can be assumed to potentially lead to a low contrast between the text and its background for any reader).
:::It would be useful if you could describe ''how'' you were confused. The table is quite clear, showing the layout for codepoints U+0000 to U+7FFFFFFF. The accompanying text explains that the current standard uses the scheme for the initial portion up to U+10FFFF, which goes into the 4-byte area but does not exhaust it. This seems perfectly clear to me. Any table trying to show the "21-bit" space directly would not be nearly as clear; it would obscure the design of the encoding, and would require more verbiage to explain it. The one improvement I would suggest is that the reduction of the codespace to U+10FFFF might usefully come ''before'' the table, so that the reader understands immediately that the full scheme is not currently used by Unicode. --- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 04:23, 7 July 2014 (UTC)
{{Quote frame

|text=[https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245876774#Encoding →Encoding]: fix the colour blindness issue
::::Elphion, I think you and I basically agree. The only modification I'd make to your proposal is to suggest that the clarification of the codespace reduction be made within the table itself. As I said, I think tables/charts/graphs/etc should be self-contained with respect to essential information. The possible exception is a caption, but that's effectively part of what it's captioning. As for why I was confused, it was because the table didn't include such a clarification. I think sometimes it's difficult for those of us who edit an article to see it "with fresh eyes" like a new reader. When we look at the table, we're already aware of the content of the following prose because we've already read it. [[User:Proxyma|Proxyma]] ([[User talk:Proxyma|talk]]) 06:44, 8 July 2014 (UTC)
|author=[[user:Thumperward]]

|source=([https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245876774 edit C])
:::::There have been endless attempts to colorize the table and split line 4 to "clarify it". All results are obviously less clear and have been reverted. They hid the pattern (by splitting line 4) and they just had to add more text than is currently attached to explain what the colored portion did. Or they did not split line 4 but used 3 colors and added even more text than is currently attached. Face it, it is impossible. Stop trying. Only possible change may be to move some of the text before the table, but I think that is less clear than the current order of "original design:", table, "modified later by this rfc...". That at least is in chronological order.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 18:56, 8 July 2014 (UTC)
}}

This is attempting to fix a potential issue for the colour-blind, but I think it unfortunately only ends up denying the help the colour was there to provide from '''both the colour-blind and not'''.
==Moved from article==
The colours were '''NEVER the primary way to convey any data''', but an additional help to make the parsing of the information faster and less straining to the eye (removing the need to count anything, you don't need to know that a hex digit covers 4 bits, or that the {{mono|0x7}} on the left column corresponds to the first {{mono|xxx}} on the right, and whether you do or don't, '''you just instantly ''see'' the relationship''' without thinking. This is obviously '''highly desirable''' in data visualization.
(Should the word deprecated be added here like this | They supersede the definitions given in the following deprecated and/or obsolete works: ? [[User:Cya2evernote|Cya2evernote]] ([[User talk:Cya2evernote|talk]]) 14:31, 11 February 2014 (UTC))

== Noncharacters ==

: [[User:Incnis Mrsi|Incnis Mrsi]] made a [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=598139607&oldid=598136720 change] to state that surrogates and noncharacters may not be encoded in UTF-8, and I changed this to only surrogates as noncharacters can be legally represented in UTF-8. [[User:BIL|BIL]] then reverted my edit with the comment "Noncharacters, such as reverse byte-order-mark 0xFFFE, shall not be encoded, and software are allowed to remove or replace them in the same ways as for single surrogates". This is simply untrue, and I am pretty sure that nowhere in the Unicode Standard does it specify that noncharacters should be treated as illegal codepoints such as unpaired surrogates. In fact the Unicode Standard [http://www.unicode.org/versions/corrigendum9.html Corrigendum #9: Clarification About Noncharacters] goes out of its way to explain that noncharacters are permissible for interchange, and that they are called noncharacters because "they are permanently prohibited from being assigned standard, interchangeable meanings, rather than that they are prohibited from occurring in Unicode strings which happen to be interchanged". I think it is clear that noncharacters can legitimately be exchanged in encoded text, and as they can be represented in UTF-8, the article should not claim that they cannot be represented in UTF-8. [[User:BabelStone|BabelStone]] ([[User talk:BabelStone|talk]]) 18:04, 5 March 2014 (UTC)

: The Unicode standard seems only concerned with making sure UTF-16 can be used. The noncharacters mentioned can be encoded in UTF-16 no problem. Only the surrogate halves cannot be encoded in UTF-16 so they are trying to fix this by declaring them magically illegal and pretending they don't happen. So there is a difference and user BIL is correct. (note that I think UTF-16 is seriously broken and should have provided a method of encoding a continuous range, just like UTF-8 can encode the range 0x80-0xff even thought those values are also 'surrogate halves'.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 05:38, 7 March 2014 (UTC)

== Proposal of UTF-8 use lists ==

The article's introduction have an [[Logical assertion|assertion]] that need citation:
:"UTF-8 is also increasingly being used as the default character encoding in [[operating systems]], [[programming languages]], [[application programming interface|APIs]], and [[application software|software applications]]"

Is difficult to find a unique source for all these applications. The alternative is to start a Wikipedia-list for all surveys:

: ~ [[List of software that are UFT-8 compatible]]

So, in the list, grouped as below, add tables with columns "name", "extent of compatibility", "have suport to UTF-8" and "use UTF-8 as default". Tables:

*Standards:
** [[Operating system]] specifications compatible with UTF-8 (ex. POSIX);
** [[Programming language]] specifications compatible with UTF-8 (ex. Python);
** Web protocols compatible with UTF-8 (ex. SOAP);
** ...

* Software:
** [[Operating systems]] compatible with UTF-8;
** [[Compiler]]s compatible with UTF-8;
** Mobile APIs compatible with UTF-8;
** ... compatible with UTF-8;

--[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 11:17, 10 August 2014 (UTC)

: There's no real limit to these lists, and no clear definition. Is Unix v7 compatible with UTF-8 because you can store arbitrary non-ASCII bytes in filenames? A lot of Unix and Unix programs are high-bit safe. Python isn't especially compatible with UTF-8; it can input any number of character sets, and I believe its internal encoding is nonstandard. Likewise, a lot of programs can process UTF-8 as one character set among many.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 21:12, 10 August 2014 (UTC)

:: I think that there are two simple and objective criteria:
::# a kind of "''self-determination''": the software express (ex. in the manual pages) that is UTF-8 compatible;
::# a kind of ''confirmation'': other sources confirm the UTF-8 compatibility.
::No more, no less... It is enough for the list objectives, for users, etc. See EXEMPLES below. --[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 00:51, 18 August 2014 (UTC)

=== Examples ===
Draft illustrating the use of the two types of references, indicating "self-determination", and "confirming that it does".
* Python3:
** ''Source code is UTF-8 compatible''. Self-determination [http://legacy.python.org/dev/peps/pep-0263/ ref-1] and [https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding ref-2]. Independent sources: S1. "By default, Python source files are treated as encoded in UTF-8.", [Van Rossum, Guido, and Fred L. Drake Jr. [http://www.mindlabyrinth.ru/upload/iblock/36c/36cf608e09a426232173e0d041791ed6.pdf Python tutorial]. Centrum voor Wiskunde en Informatica, 1995]. S2. "In Python 3, all strings are sequences of Unicode characters". [http://www.diveintopython3.net/strings.html#divingin diveintopython3.net].
** ''Build-in functions are UTF8-compatible''. Self-determination [https://docs.python.org/3/library/string.html?highlight=strings string — Common string operations]. Independent sources: ...
** ''Support at the core language level'': no.

* PHP5:
** ''Source code is UTF-8 compatible''. ...
** ''SOME build-in functions are UTF8-compatible''. see <code>mb_*</code> functions and PCRE... and str_replace() and some another ones.
** not compatible, but accepts automatically UTF-8 source-code and incorpore compatible libraries like mb*, PCRE, etc.
** ''Support at the core language level'': no. (see PHP6 history).

* MySQL: yes, have compatible modes. ...
* PostgreSQL: yes. have compatible modes. ...
* [[libXML2]]: use UTF-8 as default (''support at the core level'')...
* ...
<small><span class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Krauss|Krauss]] ([[User talk:Krauss|talk]] • [[Special:Contributions/Krauss|contribs]]) 00:51, 18 August 2014</span></small><!-- Template:Unsigned -->
:I don't think a list of software compatible with UTF-8 is useful. Eventually, ''all'' software that is used in any notable manner will be UTF-8 compatible. To do the job properly would require exhaustive mentions of versions and a definition of "compatible" (Lua is compatible with UTF-8 but has no support for it). Such a list is not really suitable here. [[User:Johnuniq|Johnuniq]] ([[User talk:Johnuniq|talk]]) 01:25, 18 August 2014 (UTC)
::Maybe UTF-8 usage is increasing but I don't think it is taking any lead. The heavily used languages C# and Java use UTF-16 as default and Windows does also. I don't think that will change in short term. --[[User:BIL|BIL]] ([[User talk:BIL|talk]]) 07:58, 18 August 2014 (UTC)
:::Sure, but even Notepad can read and write UTF-8 these days, so it would feature on a list of software compatible with UTF-8. I can't resist spreading the good word: http://utf8everywhere.org/ [[User:Johnuniq|Johnuniq]] ([[User talk:Johnuniq|talk]]) 11:00, 18 August 2014 (UTC)
:'''Oppose''' - as [[User:Johnuniq|Johnuniq]] says, this list will be huge and essentially useless. [[User:RossPatterson|RossPatterson]] ([[User talk:RossPatterson|talk]]) 10:41, 18 August 2014 (UTC)
: I have no idea what it means for Python 3 to not have "support at the core language level". It reads in and writes out UTF-8 and hides the details of the encoding of the Unicode support. I don't think this is a productive thing to add to the page.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 22:00, 18 August 2014 (UTC)
: '''Oppose''', per Johnuniq's explanation. Such a list would be too long, it would never be complete, and it would doubtfully be used for the intended purpose. &mdash;&nbsp;[[User:Dsimic|Dsimic]]&nbsp;([[User talk:Dsimic#nobold|talk]]&nbsp;|&nbsp;[[Special:Contributions/Dsimic|contribs]]) 08:20, 22 August 2014 (UTC)

=== Next step ... ===

# '''To remove''' the assertion "UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications" '''of the article's introduction'''. Need citation, but, as demonstred, never will get one.
# ... Think about another kind of list, tractable and smaller, like ''"List of software that are UFT-8 FULL compatible"''; that is, '''discuss here what is "full compatible" in nowadays'''. Examples: [[LibXML2]] can be showed as "configured with UTF8 by default" and "full compatible"; PHP was looking for "full compatibility" and "Unicode integration" with [[PHP6#PHP_6_and_Unicode|PHP6]], but abandoned the project.

--[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 09:35, 22 August 2014 (UTC)

A bit of searching found these:

https://developer.apple.com/library/mac/qa/qa1173/_index.html

https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text

http://wayland.freedesktop.org/docs/html/apa.html#protocol-spec-wl_shell_surface-request-set_title

[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 00:51, 24 August 2014 (UTC)

== Double error correction ==
[[:File:UnicodeGrow2010.png|thumb|360px|Graph indicating that UTF-8 (light blue) exceeded all other encodings of text on the Web in 2007, and that by 2010 it was nearing 50%.&lt;ref name="MarkDavis2010"/&gt; Given that some ASCII (red) pages represent UTF-8 as [[Html_entity#Character_references|entities]], it is more than half.&lt;ref name="html4_w3c"&gt;]]

The legend says ''"This may include pages containing only ASCII but marked as UTF-8. It may also include pages in CP1252 and other encodings mistakenly marked as UTF-8, these are relying on the browser rendering the bytes in errors as the original character set"''... but, it is not the original idea, we can not count something "mistakenly marked as UTF-8", even it exist. The point is that there are a lot of ASCII pages that have symbols that [[web-browser]] map to UTF-8.

[[PubMed Central]], for example, have 3.1 MILLION Articles in ASCII but using real UTF-8 by entity encode. No one is a mistake.

The old text (see thumb '''here''') have a note &lt;ref name="html4_w3c"&gt; is: { { "[http://www.w3.org/TR/html4/charset.html HTML 4.01 Specification, Section 5 - HTML Document Representation]", W3C Recommendation 24 December 1999. Asserts "Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding. (...) Character references are a character encoding-independent (...).". See also [[Unicode and HTML#Numeric_character_references|Unicode and HTML/Numeric character references]].} }

This old text have also some confusion (!)... so I corrected to ''"Many ASCII (red) pages have also some [[Universal_Character_Set|ISO 10646]] symbols representanted by [[Html_entity#Character_references|entities]],[ref] that are in the UTF-8 repertoire. That set of pages may be counted as UTF-8 pages."''

--[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 22:45, 23 August 2014 (UTC)

:I reverted this as you seem to have failed to understand it.

:First, an Entity IS NOT UTF-8!!!!!!! They contain only ascii characters such as '&' and digits and ';'. They can be correctly inserted into files that are NOT UTF-8 encoded and are tagged with other encodings.

:Marking an ASCII file as UTF-8 is not a mistake. An ASCII file is valid UTF-8. However since it does not contain any multi-byte characters it is a bit misleading to say these files are actually "using" UTF-8.

:Marking CP1252 as UTF-8 is very common, especially when files are concatenated, and browsers recognize this due to encoding errors. This graph also shows these mis-identified files as UTF-8 but they are not really.

:[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 23:58, 23 August 2014 (UTC)

:: Sorry about my initial confused text. Now we have another problem here, is about interpretation of W3C standards and statistics.
:: '''1. RFC 2047''' ([[MIME#Content-Type|MIME Content-Transfer-Encoding]]) '''interpretarion''' used in the <tt>charset</tt> or <tt>enconding</tt> attributes of [[HTTP]] (content-type header with charset) and [[HTML4]] (meta http-equiv): say <u>what must be interpreted as ''"ASCII page"'' and what is a ''"UTF-8 page"''</u>. Your assertion "an ASCII file is valid UTF-8" is a distortion of these considerations.
:: '''2. W3C standards, HTML4.1 (1999)''': say that you can add to an '''ASCII page''' some ''special symbols'' (''ISO 10646'' as expressed by the standard) by entities. Since before 2007, what all [[web-browser]] do, when typing special symbols, is replace the entity by an UTF-8 character ([[Rendering (computer graphics)|rendering]] the entity as its standard UTF-8 [[glyph]]).
:: '''3. Statistics''': this kind of ''statistics report'' must use first the [[technical standard]] options and variations. These options have concrete consequences that can be relevant to the "counting web pages". The user mistakes may be a good [[statistical hypothesis testing]], but you must first to prove that they exist and that they are relevant... In this case, you must to prove that the "user mistake" is more important than ''technical standard option''. In an encyclopedia, we did not show unproven hypothesis, neither a irrelevant one.
:: --[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 10:23, 24 August 2014 (UTC)

::: An ASCII file is valid UTF-8. That's irrefutable fact. To speak of "its standard UTF-8 glyph" is a category error; UTF-8 doesn't have glyphs, as it's merely a mapping from bytes to Unicode code points.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 21:23, 24 August 2014 (UTC)

::::To elaborate on the second point above: Krauss in conflating "Unicode" and "UTF-8". They are not the same. A numerical character entity in HTML (e.g., &#x0026;#355; or &#x0026;#x0163;) is a way of representing a Unicode codepoint using only characters in the printable ASCII range. A browser finding such an entity will use the codepoint number from the entity to determine the Unicode character and will use its font repertoire to attempt to represent the character as a glyph. But this process does not involve UTF-8 encoding -- which is a different way of representing Unicode codepoints in the HTML byte stream. The ASCII characters of the entity might themselves be encoded in some other scheme: the entity in the stream might be ASCII characters or single-byte UTF-8 characters, or even UTF-16 characters, taking 2 bytes each. But the browser will decode them as ASCII characters first and then keying on the "&#...;" syntax use them to determine the codepoint number in a way that does not involve UTF-8. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 21:58, 24 August 2014 (UTC)

::::I agree the problem is that Krauss is confusing "Unicode" with "UTF-8". Sorry I did not figure that out earlier.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 23:28, 25 August 2014 (UTC)

:Our job as Wikipedia editors is not to interpret the standards, nor to determine what is and isn't appropriate to count as UTF-8 "usage". That job belongs to the people who write the various publications that we cite as references in our articles. [[Mark Davis (Unicode)|Mark Davis]]'s [http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html original post on the Official Google Blog], from whence this graph came and which we (now) correctly cite as its source, doesn't equivocate about the graph's content or meaning. Neither did [http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html his previous post] on the topic. Davis is clearly a [[WP:RS|reliable source]], even though the posts are on a blog, and we should not be second-guessing his claims. That job belongs to others (or to us, in other venues), and when counter-results are published, we should consider using them. [[User:RossPatterson|RossPatterson]] ([[User talk:RossPatterson|talk]]) 11:13, 25 August 2014 (UTC)

::Thanks for finding the original source. [http://googleblog.blogspot.ch/2012/02/unicode-over-60-percent-of-web.html] clearly states that the graph is not just a count of the encoding id from the html header, but actually examines the text, and thus detects ASCII-only (I would assume also this detects UTF-8 when marked with other encodings, and other encodinds like CP1252 even if marked as UTF-8): "We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example... Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8)." The caption needs to be fixed up.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 23:28, 25 August 2014 (UTC)
::: Krauss nicely points out below [http://www.w3.org/QA/2008/05/utf8-web-growth Erik van der Poel's methodology at the bottom of Karl Dubost's W3C blog post], which makes it explicit that the UTF-8 counts do not include ASCII: "''Some documents come with charset labels declaring iso-8859-1, windows-1252 or even utf-8 when the byte values themselves are never greater than 127. Such documents are pure US-ASCII (if no ISO 2022 escape sequences are encountered).''". [[User:RossPatterson|RossPatterson]] ([[User talk:RossPatterson|talk]]) 17:24, 27 August 2014 (UTC)

Wow, a lot of discussion! So many intricate nuances of interpretations, sorry I was to imagine something more simple when started...

* "Unicod" vs "UTF8": Mark Davis use "Unicod (UTF8...)" in the legend, and later, in the text, express "As you can see, Unicode has (...)". So, for his public, "Unicode" and "UTF8" are near the same thing (only [http://www.artima.com/weblogs/viewpost.jsp?thread=230157 very specialized public fells pain with it]). Here, in our discussion, is difficult to understand what the technical-level we must to use.
* About Mark Davis methodology, etc. no citation, only a vague "Every January, we look at the percentage of the webpages in our index that are in different encodings"... <br/>But, '''[http://www.w3.org/QA/2008/05/utf8-web-growth SEE HERE similar discussion, by those who did the job]''' (the data have been compiled by Erik van der Poel)
* Trying an answer about ''glyph'' discussion: the Wikipedia [[glyph|glyph article]] is a little bit confuse (let's review!); see [http://www.w3.org/Math/characters/html/symbol.html W3C use of the term]. In a not-so-technical-jargon, or even in the W3C's "loose sense", we can say that [http://dev.w3.org/html5/html-author/charref there are a set of "standard glyphs/symbols"] that are represented in a [[subset]] of "UTF-8-like symbols", and are not in ASCII neither CP1252 "symbols"... Regular people see that "ASCII&ne;CP1252" and "UTF8&ne;CP1252"... So, ''even regular people see that "ASCII&ne;UTF8" in the context of the illustration, and that HTML-entities are maped to something that is a subset of UTF8-like symbols''.
Mark Davis not say any thing about HTML-entities or about "user mistakes", so, '''sugestion''': let's remove it from article's text.
<br/>--[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 03:33, 26 August 2014 (UTC)
: Neither W3C page you point to says anything about UTF-8, and I don't have a clue where you're getting "UTF-8-like symbols" from. Unicode is the map from code points to symbols and all the associated material; UTF-8 is merely a mapping from bytes to code points. The fact that it can be confusing to some does not make it something we should conflate.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 06:00, 27 August 2014 (UTC)
:: My text only say "W3C use of the term" (the term "glyph" not the term "UTF-8"), and there (at the linked page) are a table with a "Glyph" column, with images showing the ''typical symbols''. This W3C use of the term "glyph" as typical symbol, conflicts with the Wikipedia's thumb [[:File:A-small glyphs.svg|illustration]] with the text "various glyphs representing the typical symbol". Perhaps W3C is wrong, but since 2010 we need Refimprove (Wikipedia's glyph definition "needs additional citations for verification").
:: About my bolded sugestion, "let's remove it", ok? need to wait or vote, or we can do it now? --[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 12:38, 27 August 2014 (UTC)
:: I'm confused. Is Krauss questioning Mark Davis's reliability as a reference for this article? It seems to me that the graphs he presents are entirely appropriate to this article, especially after reading Erik van der Poel's methodology, as described in his 2008-05-08 post at the bottom of [http://www.w3.org/QA/2008/05/utf8-web-growth Karl Dubost's W3C blog post], which is designed to recognize UTF-8 specifically, not just Unicode in general. [[User:RossPatterson|RossPatterson]] ([[User talk:RossPatterson|talk]]) 17:16, 27 August 2014 (UTC)
::: Sorry my english, I supposed Mark Davis and W3C as reliable sources (!). I think Mark Davis and W3C write some articles to the "big public" and other articles to the "specialized technical people"... We here can not confront "specialized text" with "loose text", even of the same author: this confrontation will obviously generates <u>some "false evidence of contradiction"</u> (see ex. "Unicode" ''vs'' "UTF8", "glyph" ''vs'' "symbol", etc. debates about correct use of the terms). About Erik van der Poel's explanations, well, this is other discussion, where I agree your first paragraph about it, "Our job as Wikipedia editors (...)". Now I whant only to check the ''sugestion'' ("let's remove it from article's text" above). --[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 11:13, 28 August 2014 (UTC)

It appears this discussion is moot - the graph image has been [https://en.wikipedia.org/enwiki/w/index.php?title=File:UnicodeGrow2010.png&curid=43520324&diff=623323499&oldid=620614958 proposed for deletion 2 days from now]. [[User:RossPatterson|RossPatterson]] ([[User talk:RossPatterson|talk]]) 03:41, 30 August 2014 (UTC)
: Thanks, fixed. --[[User:Krauss|Krauss]] ([[User talk:Krauss|talk]]) 17:25, 30 August 2014 (UTC)

== Backward compatibility: ==

Re: ''One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.''

I'm a non-guru struggling with W3C's strong push to UTF-8 in a world of ISO-8859-1 and windows-1252 text editors, but either I have misunderstood this completely or else it is wrong? Seven-bit is the same in ASCII or UTF-8, sure; but in 8-bit extended ASCII (whether "extended" to ISO-8859-1, windows-1252 or whatever), a byte with the MSB "on" is '''one''' byte in extended ASCII, '''two''' bytes in UTF-8. A parser expecting "8-bit extended ASCII" will treat ''each'' of the UTF-8 bytes as a character. Result, misery. Or have I missed something?
[[User:Wyresider|Wyresider]] ([[User talk:Wyresider|talk]]) 19:18, 5 December 2014 (UTC)


Even without doing anything to suboptimal colours, when they are only potentially hard to distinguish '''from each other''' instead of the background, the remaining distinguishable groups still serve the original purpose, only with some of it missing or hard to see. The monochrome version ends up being strictly worse.
:No, it is not a problem unless your software decides to take two things that it thinks are "characters" and insert another byte in between them. In 99.999999% of the cases when reading the bytes in, the bytes with the high bit set will be output unchanged, still in order, and thus the UTF-8 is preserved. You might as well ask how programs handle english text when they don't have any concept of correct spelling and each word is a bunch of bytes that it looks at individually. How do the words get read and written when the program does not understand them? It is pretty obvious how it works, and this is why UTF-8 works too.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 19:52, 5 December 2014 (UTC)


{{blockquote|text=Another way is to <small>replace the straight <code>x</code>'s with different symbols and have the key indicated on the ranges somehow, a mock-up:<br><code>U+0{{Ruby|080|xyz}} .. U+0{{Ruby|7FF|xyz}} &#124; 110xxxyy 10yyzzzz</code> (hex digit resolution)<br><code>U+0{{Ruby|080|xyy}} .. U+0{{Ruby|7FF|xyy}} &#124; 110xxxyy 10yyyyyy</code> (byte resolution) and this can be in addition to colouring that doesn't sacrifice contrast for anyone.</small><br><br>I just tried something like that in [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=1247029307&oldid=1246770059 these edits]. It's not ideal, especially how it makes the sentence before it quite unpleasant to read.}}
:Wyresider -- This has been discussed in the article talk archives. Most of the time, if a program doesn't mess with what it doesn't understand, or treats sequences of high-bit-set characters as unanalyzable units, then simple filter etc. programs will often pass non-ASCII UTF8 characters through unaltered. It's a design feature of UTF8 which is designed to lighten the programming load of transition from single-byte to UTF8 -- though certainly not an absolute guarantee of backward compatibility... [[User:AnonMoos|AnonMoos]] ([[User talk:AnonMoos|talk]]) 14:51, 7 December 2014 (UTC)


I think these should be considered before removing colour outright:
: In a world of ISO-8859-1 and Windows-1252 text editors? What world is that? I live in a world where the most-spoken language is Chinese, which clears a billion users alone, and the text editors that come with any remotely recent version of Linux, Windows or Macs, or any version of Android or iOS, support UTF-8 (or at least Unicode). There's no magic button that makes UTF-8 work invariably with systems expecting 8-bit extended ASCII (or Windows-1252 with systems expecting 8-bit extended ASCII to not use [[C0 and C1 control codes|C1 control codes 80-9F]]), but UTF-8 works better then, say, [[Big5]] (which uses sub-128 values as part of multibyte characters) or [[ISO-2022-JP]] (which can use escape sequences to define sub-128 values to mean a character set other then ASCII).--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 13:45, 8 December 2014 (UTC)
* Do the colours used here even have a problem with contrast with the background, (or only amongst themselves and they are not providing information)? Maybe it's just that we should avoid the potential low-contrast combinations '''even for those with normal vision''', such as:
** Overly bright colours, such as <span style="color:yellow;">bright yellow</span> (after switching to light background, I '''really struggle''' to read "bright yellow" there)
** Overly dark colours, such as <span style="color:blue;">deep blue</span> (after switching to dark background, I struggle to read "deep blue" there)
** Colours close to even the rest of the corresponding brightnesses between the light and dark mode background and their respective <code>overlay backgrounds like this one of &lt;code&gt;</code>


I think the least total effort catch-all long-term solution would be to provide a site-wide toggle on the side that overrides all text and background colouring when you want, probably makes sense beside the existing "Light" and "Dark" mode toggles, to {{Tooltip|force foreground elements close to the opposite end|or to adjust the existing colours somehow, but this would be more complicated}}.
: [[WP:FORUM|Wikipedia Talk pages are not a forum]], but to be crystal clear, [[ASCII]] bytes have a high bit of zero and are UTF-8-clean, and anything that has a high bit of one isn't ASCII and will almost certainly have some bytes that will be treated differently in a UTF-8 context. A parser expecting data encoded in Windows codepage 1252 or in ISO 8859-1 isn't parsing ASCII, and won't understand UTF-8 correctly. [[User:RossPatterson|RossPatterson]] ([[User talk:RossPatterson|talk]]) 00:09, 9 December 2014 (UTC)
[[File:CVD-friendly sequential colormaps.png|thumb|left|Three sequential colormaps that have been designed to be accessible to the color blind]]


The other solution to fix all of what [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&diff=prev&oldid=1245876774 edit C] attempted to fix, (and the solution applicable right here and now) would be to use a palette that is also readable for the colour blind, such as these three palettes found on [[Color_blindness#Ordered_Information]] that can be used to produce distinct colours that work no matter of colour-blindness.
: There are many parsers that don't expect UTF-8 but work perfectly with it. An example is the printf "parser". The only sequence of bytes it will alter starts with an ascii '%' and contains only ascii (such as "%0.4f"). All other byte sequences are output unchanged. Therefore all multibyte UTF-8 characters are preserved. Another example is filenames, on Unix for instance the only bytes that mean anything are NUL and '/', all other bytes are considered part of the filename, and are not altered. Therefore all UTF-8 multibyte characters can be parts of filenames.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 02:24, 9 December 2014 (UTC)


<small>NOTE: They ALL work for ALL types of colour blindness, it's just a choice of which one looks the nicest.</small>
== Many errors ==
<br>Do keep in mind however that all of the selected colours still need to have good contrast from both light and dark backgrounds, so maybe the colours from the very edges of these aren't usable, like how I attempted to demonstrate above with blue and yellow.
{{clear|left}}
{{cob}}
{{cot|Other issues in the article (solved)}}
{{anchor|Other issues in the article}}
The UTF-8 article ''does'' talk about generic things about Unicode quite a bit more than I think it should, such as explaining how some "graphical characters can be more than 4 bytes in UTF-8". This is because [[Unicode]] (and by extension UTF-8) does not deal in [[grapheme]]s in the first place, but [[code point]]s (essentially just numbers to index into Unicode), which ''can'' correspond to valid Unicode [[Character (computing)|character]]s, which in turn ''can'' directly correspond to a grapheme. Some characters ''don't correspond to a grapheme at all ([[control character]]s)'', such as the [[Tags (Unicode block)|formatting tag characters]] used in the flag example, and some ''combine/join with other character(s) to to produce a combination grapheme ([[combining character|combining/joining characters]])''.
<br>The possibility of needing to use multiple code points for one grapheme like that is a direct consequence of these types of characters in general and '''isn't caused by UTF-8 or any other encoding''', and can happen through '''ANY and all encodings''' capable of encoding such code points, not just UTF-8.
<br>In short: '''The issue has nothing to do with UTF-8.'''
{{cob}}


[[User:Mossymountain|Mossymountain]] ([[User talk:Mossymountain|talk]]) 05:09, 20 September 2024 (UTC)
I'm not an expert here, but I am an engineer and I do recognize when I read something that's illogical.
[[User:Mossymountain|Mossymountain]] ([[User talk:Mossymountain|talk]]) 17:10, 20 September 2024 (UTC)


:Because the editor was offended that that section used color. [[User:Akeosnhaoe|Akeosnhaoe]] ([[User talk:Akeosnhaoe|talk]]) 08:56, 20 September 2024 (UTC)
There are 2 tables:
::It's pretty important that we not communicate information solely through color, but I wonder how we could better do something like that. <span style="border-radius:2px;padding:3px;background:#1E816F">[[User:Remsense|<span style="color:#fff">'''Remsense'''</span>]]<span style="color:#fff">&nbsp;‥&nbsp;</span>[[User talk:Remsense|<span lang="zh" style="color:#fff">'''论'''</span>]]</span> 09:02, 20 September 2024 (UTC)
:::Most of the information wasn't in the color, it was in the text readable without formatting in monochrome. The color was there just to make it easier to quickly identify which is which.<br>If what [[User:Akeosnhaoe|Akeosnhaoe]] said is the case (which I don't think it is, I think this was an honest misunderstanding with good intentions), obviously the colors should be changed to the intended visibility standard, not the information removed. [[User:Mossymountain|Mossymountain]] ([[User talk:Mossymountain|talk]]) 10:17, 20 September 2024 (UTC)


IMHO the edits made by [[user:thumperward]] were a good and powerful attempt to remove the obscene bloat of this article. The enourmous complex "examples" with color did not provide any information, and it is quite impossible to figure out what the colors mean without already knowing how UTF-8 works already. Elimination of the "code page" is IMHO a good and daring decision, one I may not have made and I'm glad he tried it. I'd like to continue, pretty much removing the bloated mess of "comparisons" that are either obvious or that nobody cares about, the few useful bits of info there can be merged into the description.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 18:13, 20 September 2024 (UTC)
[[UTF-8#Description|https://en.wikipedia.org/wiki/UTF-8#Description]]


[[UTF-8#Codepage_layout|https://en.wikipedia.org/wiki/UTF-8#Codepage_layout]]
:My most important point, by far, is that '''I vehemently disagree with the removal of the [https://en.wikipedia.org/w/index.php?title=UTF-8&oldid=1245874905#Codepage_layout code page]'''.
:It is the single thing with the most useful information packed on the article and irreplaceable in utility. ''I don't understand what was wrong with it at all''. I see its removal as the {{anchor|cylinder example}} {{Tooltip|same kind of hindrance as deleting all of the drawings that visualize what measurements the letters <code>h, r, d</code> represent on a [[cylinder]] from that article.<br>This makes it a lecture where the professor can only attend by talking over the phone. No gestures, no diagrams, nothing. It "does still work", it's just requires more effort from the students (and from the professor, but that's a one-time cost here)}} Yes, you ''technically still can'' glean all of the same information by reading through the article and spending effort to understand what you read, but it would outright DENY the use case where one just looks at a picture or two for a couple of seconds and is already able to close the article, while hindering the rest of the readers by not providing the still useful clarification as study aides.
:I'm firmly in the camp that believes that for virtually all human readers, some well thought out visualizations illustrating some concept's defining characteristics '''only help''' in understanding, they are the best way to essentially convey "what something looks like", be it logically (like in this case) or physically. I personally have visited the [[UTF-8]] page specifically for the code page for years whenever I needed a refresher when dealing with the encoding. Sure, I could have dug up a cumbersome specification and {{key press|^F}}'d through it to achieve the same thing in at least double the time, but the article was easily the best resource I've found on the internet for understanding UTF-8, largely thanks to how well the code page was thought out and put together.
:I have only read some of the other text on the article previously, never in full before and I agree the article has had problems with bloat. In my mind this still does not mean the most useful thing should be removed in favour of briefness (it's essentially just a picture/diagram, but one that you can interact with to get more out of. The readers can easily identify that rough class of thing and skip it when they don't want to inspect it. It's very obviously not part of the text you're supposed to read out loud for example.) [[User:Mossymountain|Mossymountain]] ([[User talk:Mossymountain|talk]]) 05:36, 21 September 2024 (UTC)


{{od}} I'm not relitigating basic, universally-understood concepts such as "articles should not be hundreds of kilobytes long", "articles should not use colours to convey important information" or "articles are not supposed to be reference textbooks". These are simply settled consensus. The code page table is ''absolutely useless'' for any purpose other than implementing handling of the format, which is categorically not the point of an encyclopedia article. What this article ''should'' do is explain where UTF-8 fits into the world, how it has been adopted, and how at some basic level it works. Precisely what any given sequence of bytes happens to stand for (other than in explaining how the byte sequence informs multi-byte code points) is not pertinent, especially because the lowest seven bytes were very deliberately copied from ASCII anyway.
They cannot both be correct. If the one following #Description is correct, then the one following #Codepage_layout must be wrong.


Frankly, the major thing I gleaned from the above wall of text (and that on my talk page) is that the editor posting it hasn't actually read the article very closely. A lot of the trimming down that was performed on the text was precisely ''because'' the article should put more emphasis on UTF-8's unique features, primarily its variable-length encoding and how multiple code points can be combined into a single glyph. I argued against the (seemingly political) removal of some of that detail in the previous section of this talk page, so it makes no sense to argue that this has somehow been de-emphasised by the removal of unrelated trivia.
Embellishing on the table that follows #Description:


This article still needs a lot of work. What it does not need is the re-addition of huge, heavy blocks of content of absolutely no value outside of a reference textbook. [[User:Thumperward|Chris Cunningham (user:thumperward)]] ([[User talk:Thumperward|talk]]) 11:03, 21 September 2024 (UTC)
1-byte scope: 0xxxxxxx = 7 bits = 128 code points.


:I am not arguing those points. At least I don't think I am. The closest one is probably the third one: "articles are not supposed to be reference textbooks". I will happily concede my positions whenever I get how they break them. (I'm unable to find what you're referencing here, but what I'm arguing for shouldn't be in conflict with it, at least not with what kind of idea I assume the phrase is getting at)
2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 additional code points.
:{{Tooltip|"How multiple code points can be combined into a single glyph" has nothing to do with UTF-8.|To a character encoding there's no difference between that and these characters forming a sentence.}} I wrote about this at [[#Other issues in the article]] above.
:Combining differing amounts of bytes to single code points on the other hand is the defining characteristic of a variable length character encoding, such as UTF-8 and its "cousins", like [[Shift_JIS#Shift_JIS_byte_map|Shift JIS]] and [[GBK_(character_encoding)#Layout_diagram|GBK]]. <small>(The links go to the respective [https://en.wikipedia.org/enwiki/w/index.php?title=UTF-8&oldid=1245874905#Codepage_layout code page layout]-equivalents on the articles.)</small>
:I have read the full article, as I said here when talking about how {{Tooltip|the code page has been very useful for me personally|as have other such diagrams on on other articles, including other code pages}}; ''"I have only read some of the other text on the article previously, never in full '''before''' and I agree the article has had problems with bloat."'' (Emphasis added, I didn't catch how ambiguous this was when proofreading!)
:I think one of the best things about such a table/picture is how it helps you build a mental map in order to get a better understanding about what you're reading: It's essentially the "picture" of the thing, what it logically looks like. Especially with colour (or some other way to subconsciously differentiate sections), it's a powerful way to visually identify and to "map" it in the brain for better understanding. This leverages the fact that visual recognition is the single strongest way for humans to match patterns and receive data. This process is largely automatic, and thus requires very little effort in comparison to constructing the "map" from scratch by reading rules about the subject. "A picture is worth a thousand words" etc. etc. This is more true the more complicated a subject is. I compared this to using diagrams on articles about mathematical concepts in the [[#cylinder example]].
:Some topics benefit greatly from such additional illustration and I believe this is one of those cases. I think that articles like this SHOULD at least show the corresponding code page, as it efficiently and intuitively summarizes the encoding. As I wrote above at "[[#Precedents/examples]] in other articles", it looks like '''all similar articles''' about 8-bit(== such a table is small) character encodings have an equivalent table or picture.
:I previously thought it was neat how UTF-8's table had additional information sprinkled in (like the hover-over Unicode ranges per start byte), but I can see how this is just extra clutter. [[Shift_JIS#Shift_JIS_byte_map]] is very clean in comparison, only listing the actual code points as text.
:About the code page being "useless for any purpose other than implementing handling of the format"; I think this is almost the other way around. In comparison to reading about a topic, when programming something I want the written details/rules instead. A picture ''can also help'', but mainly because it helps ''me'' understand the thing itself better in general, just like when just reading about it for my own sake.
:I currently interpret the rationale for edits A and B as {{Blockquote|text=Since these poorly laid out sections have both internal and external repetition, while not even close to proper essay form, it should all be removed in order to make it more inviting for someone to later write about things including some of the points from these sections. Currently, virtually no one would probably even attempt to do that because it would always end up repeating these sections, and gradually removing parts from such a consolidated an interdependent form of data is virtually impossible.}}
:'''I agree with that in general.''' It's just that I found the approach almost irresponsibly heavy-handed.
:I think the main disagreement here is {{Tooltip|whether|and how much}} an appropriate article should include technically redundant (able to be deduced when consciously spending effort to) illustrations or examples, when the rules are already explained in pure writing. I think a tiny number of pertinent {{Tooltip|examples and clarifying illustrations can greatly enhance the readability/ease of understanding of topics like this.|Especially beautiful, non-obvious examples where something sticks out "outside of the current resolution of analysis".}} Both to help make readers previously unfamiliar with topic ready to accept the details and to give returning ones a quick refresher, drastically reducing the need to read much of the text itself again. In addition to that, I'd wager '''most''' readers don't always read full articles (or even paragraphs), but instead try to skim through to find something they're after and illustrations and examples are precisely those kind of "gold nuggets"; dense, yet easily digestible information. (When time is of the essence, I definitely do this in order to "wring the information out" and these kind of things help a lot.)
:I don't think every guideline about ''what the ideal article should look like'' is supposed to be followed as strictly as technically possible and the resulting prototype applied 1:1 on every article to harshly cull the inharmonious parts off.
:<small><s>In regards to colours...</s> (merged to [[#Helper colours]] above.)</small> [[User:Mossymountain|Mossymountain]] ([[User talk:Mossymountain|talk]]) 06:30, 22 September 2024 (UTC)


== Unicode no. of characters wrong ==
3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.


Unicode has 1,111,412 characters. Please make this change. [[User:FrierMAnaro|FrierMAnaro]] ([[User talk:FrierMAnaro|talk]]) 14:17, 31 October 2024 (UTC)
4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.


:0x110000 is 1,114,112, but the number shown is after subtracting the 2048 surrogate halves (I disagree but the consensus was that they should not count) [[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 17:58, 31 October 2024 (UTC)
The article says: "The next 1,920 characters need two bytes to encode". As shown in the #Description table, 11 bits add 2048 code points, not 1920 code points. The mistake is in thinking that the 1-byte scope and the 2-byte scope overlap so that the 128 code points in the 1-byte scope must be deducted from the 2048 code points in the 2-byte scope. That's wrong. The two scopes do not overlap. They are completely independent of one another.
::Indeed, the Unicode Standard explicitly states it {{tq|contains 1,114,112 code points}} right in its introduction, but there are ''much'' fewer characters. We're just quite loose in distinguishing between code points, characters, Unicode scalar values, and not well-defined ad-hoc phrases like {{tq|valid Unicode code points}} as currently used in the second paragraph of the article. UTF-8 does not encode "code points" or "characters" but "Unicode scalar values" ([https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G25539 D76]). There are 1,112,064 of these. Not all are assigned to characters yet; some are explicitly designated noncharacters. UTF encodings can encode them all, but there are no well-formed sequences of code units that would represent surrogate code points. The wording is grossly imprecise, but the numbers are correct. – <i style="text-transform:lowercase">MwGamera</i> ([[User talk:MwGamera|talk]]) 23:10, 31 October 2024 (UTC)
:::I changed it to say "Unicode scalar values" and added a citation of the Unicode 16.0.0 standard to the reference for the number. [[User:Guy Harris|Guy Harris]] ([[User talk:Guy Harris|talk]]) 21:53, 1 November 2024 (UTC)


:Surrogate halves are "code points", but they are not themselves individually "characters" in the most common meaning of the term. They're elements which can be used in pairs to encode characters. [[User:AnonMoos|AnonMoos]] ([[User talk:AnonMoos|talk]]) 18:17, 1 November 2024 (UTC)
The text following the #Codepage_layout table says: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add." This implies there's a scope that looks like this:


2-byte scope: 01111111 10xxxxxx = 6 bits = 64 additional code points.
== Tooltips for code points ==


Can you add a tooltip? Add a tooltip to every cell of the table which shows the range of code points the byte can encode. Also add tooltips for characters beyond the 10FFFF. [[User:FrierMAnaro|FrierMAnaro]] ([[User talk:FrierMAnaro|talk]]) 07:14, 17 November 2024 (UTC)
While that's possible, it conflicts with the #Description table. These discrepancies seem pretty serious to me. So serious that they put into doubt the entire article.


== Alternative conversion table ==
[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 03:13, 23 February 2015 (UTC)
)


I have always found the conversion table a little confusing, so I made a more simple alternative.
: 11000001 10000000 encodes the same value as 01000000. So, yes, they do overlap.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 03:24, 23 February 2015 (UTC)
:: '''Huh?''' The scopes don't overlap. Perhaps you mean that they map to the same glyph? Are you sure? I don't know because I've not studied the subject, If this is a logical issue with me, it's probably a logical issue with others. Perhaps a section addressing this issue is appropriate, eh?
:: Also, what about the "Orange cells" text and the 2-bit scope I've added to be consistent? That scope conflicts with the other table. Do you have a comment about that? Thank you. --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 03:53, 23 February 2015 (UTC)
:: '''Perhaps this is what's needed.''' What do you think?
:: 1-byte scope: 0xxxxxxx = 7 bits = 128 code points.
:: 2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 alias code points that map to the same points as 0xxxxxxx.
:: 2-byte scope: 1100001x 10xxxxxx = 7 bits = 128 additional code points.
:: 2-byte scope: 110001xx 10xxxxxx = 8 bits = 256 additional code points.
:: 2-byte scope: 11001xxx 10xxxxxx = 9 bits = 512 additional code points.
:: 2-byte scope: 1101xxxx 10xxxxxx = 10 bits = 1024 additional code points.
:: 3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.
:: 4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.
:: --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 05:54, 23 February 2015 (UTC)


https://x.com/LatinSuD/status/1869138590271488375/photo/1
:For the first question, it seems you don't understand "code points" the way the article means. "Code points" here refer to [[Unicode]] code points. The unicode code points are better described in the [[Plane (Unicode)]] article. In the UTF-8 encoding the Unicode code points (in binary numbers) are directly mapped to the x:es in this table:
:1-byte scope: 0xxxxxxx = 7 bits = 128 possible values.
:2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 possible values.
:3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 possible values.
:4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 possible values.
:That means that in the 2-byte scheme you could encode the 2048 first code points, but you are not allowed to encode the first 128 code points, as described in the [[UTF-8#Overlong encodings|Overlong encodings]] section. And similarly it would be possible to encode all the 65536 first code points in the 3-byte scheme, but you are only allowed to use the 3-byte scheme from the 2049th code point. And the 4-byte scheme is used from the 66537th to the 1114112th (the last one) code point.
:For your second question, continuation bytes (starting with 10) are only allowed after start bytes (starting with 11), not after "ascii bytes" (starting with 0). The "ascii bytes" are only used in the 1-byte scope. [[User:Boivie|Boivie]] ([[User talk:Boivie|talk]]) 10:58, 23 February 2015 (UTC)
:: Thanks for your reply. What you wrote is inconsistent with what Prosfilaes wrote, to wit: "11000001 10000000 encodes the same value as 01000000." Applying the principle of [[UTF-8#Overlong encodings|Overlong encodings]], it seems to me that "11000001 10000000" is an overlong encoding (i.e., the encoding is required to be "01000000"), therefore, unless I misunderstand the principle of overlong encoding, what Prosfilaes wrote about "11000001 10000000" is wrong. I'll let you two work it out between you.
:: It occurs to me that the "1100000x 10xxxxxx" scope could therefore be documented as follows:
:: 2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 illegal codings (see: [[UTF-8#Overlong encodings|Overlong encodings]]).
:: ''Should'' it be so documented? Would that be helpful?
:: Look, I don't want to be a pest, but this article seems inconsistent and to lack comprehensiveness. I have notions, not facts, so I can't "correct" the article. I invite all contributors who ''do'' have the facts to consider what I've written. I will continue to contribute critical commentary if encouraged to do so, but my lack of primary knowledge prohibits me from making direct edits on the source document. Regards --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 15:12, 23 February 2015 (UTC)
::: I see nothing wrong with Prosfilaes' comment. Using the 2-bit scheme 11000001 10000000 would decode to code point 1000000, even if it would be the wrong way to encode it. I am also not sure it would be helpful to include illegal byte sequences in the first table under the Description header. It is clearly stated in the table from which code point to which code point each scheme should be used. The purpose of the table seem to be to show how to encode each code point, not to show how not to encode something. [[User:Boivie|Boivie]] ([[User talk:Boivie|talk]]) 17:16, 23 February 2015 (UTC)


If you like it, I (or somebody) could try to complete it and convert to SVG maybe? [[User:LatinSuD|LatinSuD]] ([[User talk:LatinSuD|talk]]) 22:09, 17 December 2024 (UTC)
:::: 1, Boivie, you wrote, "I see nothing wrong with Prosfilaes' comment." Assuming that you agree that "11000001 10000000" is overlong, and that overlong encodings "are not valid UTF-8 representations of the code point", then you must agree that "11000001 10000000" is invalid. How can what Prosfilaes wrote, "11000001 10000000 encodes the same value as 01000000," be correct if it's invalid? If it's invalid, then "11000001 10000000" doesn't encode any Unicode code point. '''Comment?''' --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 20:09, 23 February 2015 (UTC)
:::: 2, Regarding whether invalid encodings should be shown as invalid so as to clarify the issue in readers' minds, I ask: What's wrong with that? I assume you'd like to make the article as easily understood as possible. '''Comment?''' --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 20:09, 23 February 2015 (UTC)
:::: 3, Regarding the Orange cells quote: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add", it is vague and misleading because,
::::: 3.1, those cells don't ''add'' 6 bits, they cause a whole 2nd byte to be ''added'', and
::::: 3.2, they don't actually ''add'' because the 1st byte ("0xxxxxxx") doesn't ''survive'' the addition -- it's completely replaced.
:::: Describing the transition from
:::: this: 00000000, 00000001, 00000010, ... 01111111, to
:::: this: 11000010 10000000, 11000010 10000001, 11000010 10000010, ... 11000010 10111111,
:::: as resulting from "the 6 bits they add" is (lacking the detail I supply in the 2 preceding sentences) going to confuse or mislead almost all readers. It misled me. Now that I understand the process architecture, I can interpret "the 6 bits they add" as sort of a metaphorical statement, but there is a better way.
:::: My experience as an engineer and documentarian is to simply show the mapping from inputs to outputs (encodings to Unicode code points in this case) and trust that readers will see the patterns. Trying to explain the processes without showing the process is not the best way. I can supply a table that explicitly shows the mappings which you guys can approve or reject, but I need reassurance up front that what I produce will be considered. If not open to such consideration, I'll bid you adieu and move on to other aspects of my life. '''Comment?''' --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 20:09, 23 February 2015 (UTC)
::::: ¢ U+00A2 is encoded as C2 A2, and if you look in square C2 you find 0080 and in square A2 you find +22. If you in hexadecimal add the continuation byte's +22 to the start byte's 0080 you get 00A2, which is the code point we started with. So the start byte gives the first bits, and the continuation byte gives the last six bits in the code point.
::::: I have no idea why a transition from the 1-byte scheme to the 2-byte scheme would be at all relevant in that figure. [[User:Boivie|Boivie]] ([[User talk:Boivie|talk]]) 21:02, 23 February 2015 (UTC)
:::::: Thank you for the explanation. --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 21:14, 23 February 2015 (UTC)
:::::: "Orange cells with a large dot are continuation bytes..." "White cells are the start bytes for a sequence of multiple bytes". Duh! You mean that the Orange cells aren't part of a "sequence of multiple bytes"? This article is awful and you guys just don't get it. I'm not going to waste my time arguing. I'm outta here. Bye. --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 21:25, 23 February 2015 (UTC)


:We do not recommend additional media rendered as images what should really be text. <span style="border-radius:2px;padding:3px;background:#1E816F">[[User:Remsense|<span style="color:#fff">'''Remsense'''</span>]]<span style="color:#fff">&nbsp;‥&nbsp;</span>[[User talk:Remsense|<span lang="zh" style="color:#fff">'''论'''</span>]]</span> 22:28, 17 December 2024 (UTC)
:The tables are correct. The "scopes" as you call them do overlap. Every one of the lengths can encode a range of code points starting at zero, therefore 100% of the smaller length is overlapped. However UTF-8 definition further states that when there is this overlap, only the shorter version is valid. The longer version is called an "overlong encoding" and that sequence of bytes should be considered an error. So the 1-byte sequences can do 2^7 code points, or 128. The 2-byte sequences have 10 bits and thus appear to do 2^10 code points or 2048, but exactly 128 of these are overlong because there is a 1-byte version, thus leaving 2048-128 = 1920, just as the article says. In addition two of the lead bytes for 2-byte sequeces can *only* start an overlong encoding, so those bytes (C0,C1) can never appear in valid UTF-8 and thus are colored red in the byte table.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 20:02, 23 February 2015 (UTC)
:: Thank you for the explanation. --[[User:MarkFilipak|MarkFilipak]] ([[User talk:MarkFilipak|talk]]) 20:13, 23 February 2015 (UTC)
:Looks kind of nice, but there is a desire to keep the table resembling the references, which just use text. [[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 00:30, 18 December 2024 (UTC)

Latest revision as of 18:36, 21 December 2024

Table should not only use color to encode information (but formatting like bold and underline)

[edit]

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices. — Preceding unsigned comment added by 88.219.179.109 (talkcontribs) 02:26, 17 April 2020‎ (UTC)[reply]

[edit]
   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad
   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.
   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talkcontribs) 02:58, 5 April 2021 (UTC)[reply]

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)[reply]

The article contains "{{efn", which looks like a mistake.

[edit]

I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. 2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094 (talk) 16:17, 2 April 2024 (UTC)[reply]

I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. Indefatigable (talk) 19:03, 2 April 2024 (UTC)[reply]

Should "The Manifesto" be mentioned somewhere?

[edit]

More specifically, this one: https://utf8everywhere.org -- Preceding unsigned comment added by Rudxain (talk o contribs) 21:52, 12 July 2024 (UTC)[reply]

Only if it's got significant coverage in reliable sources. Remsense 22:10, 12 July 2024 (UTC)[reply]
It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing Windows NT 3.1, and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... AnonMoos (talk) 20:30, 15 July 2024 (UTC)[reply]

The number of 3 byte encodings is incorrect

[edit]

This sentence is incorrect:

Three bytes are needed for the remaining 61,440 codepoints...

FFFF - 0800 + 1 = F800 = 63,488 three byte codepoints.

The other calculations for 1, 2, and 4 byte encodings are correct. Bantling66 (talk) 02:56, 23 August 2024 (UTC)[reply]

You forgot to subtract 2048 surrogates in the D800–DFFF range. – MwGamera (talk) 08:58, 23 August 2024 (UTC)[reply]

Multi-point flags

[edit]

I'm struggling to assume good faith here with this edit. A flag which consists of five code points is already sufficiently illustrative of the issue being discussed. That an editor saw fit to first remove that example without discussion, and then to swap it out for the other example when it was pared down to one flag, invites discussion of why that particular flag was removed, and the obvious answer isn't a charitable one. Chris Cunningham (user:thumperward) (talk) 12:35, 17 September 2024 (UTC)[reply]

Yes it was restored to the pride flag for precisely the reasons you state. Spitzak (talk) 20:48, 17 September 2024 (UTC)[reply]
A better, more in-depth explanations of the flags can be found on the articles regional indicator symbol and Tags_(Unicode_block)#Current_use (the mechanism for these specific flags). I don't think it belongs in articles of specific character encodings like UTF-8 at all.
The fact that one code point does not necessarily produce one grapheme has nothing to do with a specific character encoding like UTF-8. It's a more fundamental property of the text itself and any encoding that can be used to encode some string of characters decodes back to the same characters when decoded back from the binary representation. Although very popular, UTF-8 is just one of the numerous ways to encode text to binary and back.
I wrote more about this below at Other issues in the article and sadly only then noticed this was already being somewhat discussed here. Mossymountain (talk) 10:45, 20 September 2024 (UTC)[reply]

Why was the "heart" of the article, almost the whole section of UTF-8#Encoding (Old revision) removed instead of adding a note?

[edit]

NOTE: The section seems to have been renamed (UTF-8#Encoding -> UTF-8#Description) in this edit.

I don't understand why such a large part of UTF-8#Encoding (old revision) was suddenly removed in this edit (edit A), and then this edit (edit B) (diff after both edits) instead of either:

  • Adding a note about parts of it being written poorly.
  • Rewriting some of it. (the best and the most difficult option)
  • Carefully considering removing parts that were definitely redundant (such as arguably the latter part of UTF-8#Examples (old revision)).

Both of the edits removed a separate, and quite a well-written example (at least for my brain, these very examples made understanding UTF-8 require significantly less effort spent thinking). I don't think removing them was a good decision. Yes, you could explain basically anything without using examples, but in my experience an example is usually the easiest and fastest way for someone to understand almost any concept, especially when the examples were so visual and beautifully simple. I see it in the same category as a lecturer speaking with his hands and writing+drawing relevant things on a whiteboard versus having to hold the lecture by speaking over the phone.

The 1st, edit A

[edit]
→‎Encoding: this entire section is almost completely opaque and its inclusion stymies the addition of some clear prose describing how unicode is decoded
— user:Thumperward, (edit A)

To me, this reads as if UTF-8 was accidentally conflated with Unicode, causing a mistake to remove the parts from the wrong article (Having thought about it more, I now think it's) a severe disagreement of article design/presentation style.
(I still think edit notes asking for rewrites would have been the way to go instead of nuking the information, and that for some of the items, an article-like rewrite would be the wrong choice: Some data is way more enjoyable and simple to read visually from a table than it is to glean from written or spoken word and, as such, should be visualized in a table.)

I am strongly of the mind that the deleted parts included the two most important parts of the whole article, that must definitely be included as they are the very core of the article:

  1. The UTF-8#Codepage layout (old revision), in my opinion the most important part of any article about a character encoding. This part was in my opinion also designed, formatted and written exemplarily well here. The colour palette could be adjusted accordingly if it's a problem for the colour-blind.
    - Precedents/Examples in other articles about specific character encodings:
  2. The first list (numbered 1..7) of UTF-8#Examples (old revision) that clearly, by a singular simple example demonstrates how UTF-8 works. (I agree it could be rewritten, the language used is quite verbose)

Sweeping the less important items under these rugs to make this seem shorter:

The 2nd, edit B

The 2nd, edit B

[edit]
→Encoding: this now refers to removed text and contradicts repeated assertions elsewhere that overlong encodings are unnecessary
— user:Thumperward, (edit B)

This edit removed the whole section UTF-8#Overlong encodings (old revision). I disagree with its removal.

  1. The example removed in this edit was a clear and easy to understand way of explaining what an overlong encoding means.
  2. I don't understand what the deleted text is referred to have contradicted, unless this is something like the mention in UTF-8#Implementations and adoption of Java's "Modified UTF-8" that uses an overlong encoding for the null character. Overlong encodings aren't merely "unnecessary", they are *utterly forbidden*/invalid/illegal.
    • Apart from the lacking citation, which probably should have been rfc3629 § 3, I don't understand what was wrong with the second paragraph. I also consider the information presented in it essential for the article. (A simple decoder implementation could easily just pass the overlong encodings as if they were single-byte characters, or choose to simplify encoding by using a fixed length. The paragraph gives two good reasons why such encodings are illegal, that are now completely gone from the article.)
About removing helper colours and edit C

The 3rd, edit C

[edit]

This is about font colouring on UTF-8#Encoding (old version), it reverts this edit by User:Nsmeds. The textual information stays the same between the two, the edit only removes the custom colours.

I would prefer some form of colouring to be added back.
Properly selected helper colours shouldn't be against anything:
I don't think {{colorblind|section}} or Wikipedia:Manual_of_Style/Accessibility#Color are at all suggesting the wiping of non-essential helper colours when they could only be potentially hard to distinguish from each other. What is definitely suggested instead is fixing situations with colouring that can make the text hard to read (colouring that can be assumed to potentially lead to a low contrast between the text and its background for any reader).

→Encoding: fix the colour blindness issue
— user:Thumperward, (edit C)

This is attempting to fix a potential issue for the colour-blind, but I think it unfortunately only ends up denying the help the colour was there to provide from both the colour-blind and not. The colours were NEVER the primary way to convey any data, but an additional help to make the parsing of the information faster and less straining to the eye (removing the need to count anything, you don't need to know that a hex digit covers 4 bits, or that the 0x7 on the left column corresponds to the first xxx on the right, and whether you do or don't, you just instantly see the relationship without thinking. This is obviously highly desirable in data visualization.

Even without doing anything to suboptimal colours, when they are only potentially hard to distinguish from each other instead of the background, the remaining distinguishable groups still serve the original purpose, only with some of it missing or hard to see. The monochrome version ends up being strictly worse.

Another way is to replace the straight x's with different symbols and have the key indicated on the ranges somehow, a mock-up:
U+0080(xyz) .. U+07FF(xyz) | 110xxxyy 10yyzzzz (hex digit resolution)
U+0080(xyy) .. U+07FF(xyy) | 110xxxyy 10yyyyyy (byte resolution) and this can be in addition to colouring that doesn't sacrifice contrast for anyone.


I just tried something like that in these edits. It's not ideal, especially how it makes the sentence before it quite unpleasant to read.

I think these should be considered before removing colour outright:

  • Do the colours used here even have a problem with contrast with the background, (or only amongst themselves and they are not providing information)? Maybe it's just that we should avoid the potential low-contrast combinations even for those with normal vision, such as:
    • Overly bright colours, such as bright yellow (after switching to light background, I really struggle to read "bright yellow" there)
    • Overly dark colours, such as deep blue (after switching to dark background, I struggle to read "deep blue" there)
    • Colours close to even the rest of the corresponding brightnesses between the light and dark mode background and their respective overlay backgrounds like this one of <code>

I think the least total effort catch-all long-term solution would be to provide a site-wide toggle on the side that overrides all text and background colouring when you want, probably makes sense beside the existing "Light" and "Dark" mode toggles, to force foreground elements close to the opposite end.

Three sequential colormaps that have been designed to be accessible to the color blind

The other solution to fix all of what edit C attempted to fix, (and the solution applicable right here and now) would be to use a palette that is also readable for the colour blind, such as these three palettes found on Color_blindness#Ordered_Information that can be used to produce distinct colours that work no matter of colour-blindness.

NOTE: They ALL work for ALL types of colour blindness, it's just a choice of which one looks the nicest.
Do keep in mind however that all of the selected colours still need to have good contrast from both light and dark backgrounds, so maybe the colours from the very edges of these aren't usable, like how I attempted to demonstrate above with blue and yellow.

Other issues in the article (solved)

The UTF-8 article does talk about generic things about Unicode quite a bit more than I think it should, such as explaining how some "graphical characters can be more than 4 bytes in UTF-8". This is because Unicode (and by extension UTF-8) does not deal in graphemes in the first place, but code points (essentially just numbers to index into Unicode), which can correspond to valid Unicode characters, which in turn can directly correspond to a grapheme. Some characters don't correspond to a grapheme at all (control characters), such as the formatting tag characters used in the flag example, and some combine/join with other character(s) to to produce a combination grapheme (combining/joining characters).
The possibility of needing to use multiple code points for one grapheme like that is a direct consequence of these types of characters in general and isn't caused by UTF-8 or any other encoding, and can happen through ANY and all encodings capable of encoding such code points, not just UTF-8.
In short: The issue has nothing to do with UTF-8.

Mossymountain (talk) 05:09, 20 September 2024 (UTC) Mossymountain (talk) 17:10, 20 September 2024 (UTC)[reply]

Because the editor was offended that that section used color. Akeosnhaoe (talk) 08:56, 20 September 2024 (UTC)[reply]
It's pretty important that we not communicate information solely through color, but I wonder how we could better do something like that. Remsense ‥  09:02, 20 September 2024 (UTC)[reply]
Most of the information wasn't in the color, it was in the text readable without formatting in monochrome. The color was there just to make it easier to quickly identify which is which.
If what Akeosnhaoe said is the case (which I don't think it is, I think this was an honest misunderstanding with good intentions), obviously the colors should be changed to the intended visibility standard, not the information removed. Mossymountain (talk) 10:17, 20 September 2024 (UTC)[reply]

IMHO the edits made by user:thumperward were a good and powerful attempt to remove the obscene bloat of this article. The enourmous complex "examples" with color did not provide any information, and it is quite impossible to figure out what the colors mean without already knowing how UTF-8 works already. Elimination of the "code page" is IMHO a good and daring decision, one I may not have made and I'm glad he tried it. I'd like to continue, pretty much removing the bloated mess of "comparisons" that are either obvious or that nobody cares about, the few useful bits of info there can be merged into the description.Spitzak (talk) 18:13, 20 September 2024 (UTC)[reply]

My most important point, by far, is that I vehemently disagree with the removal of the code page.
It is the single thing with the most useful information packed on the article and irreplaceable in utility. I don't understand what was wrong with it at all. I see its removal as the same kind of hindrance as deleting all of the drawings that visualize what measurements the letters h, r, d represent on a cylinder from that article.
This makes it a lecture where the professor can only attend by talking over the phone. No gestures, no diagrams, nothing. It "does still work", it's just requires more effort from the students (and from the professor, but that's a one-time cost here)
Yes, you technically still can glean all of the same information by reading through the article and spending effort to understand what you read, but it would outright DENY the use case where one just looks at a picture or two for a couple of seconds and is already able to close the article, while hindering the rest of the readers by not providing the still useful clarification as study aides.
I'm firmly in the camp that believes that for virtually all human readers, some well thought out visualizations illustrating some concept's defining characteristics only help in understanding, they are the best way to essentially convey "what something looks like", be it logically (like in this case) or physically. I personally have visited the UTF-8 page specifically for the code page for years whenever I needed a refresher when dealing with the encoding. Sure, I could have dug up a cumbersome specification and ^F'd through it to achieve the same thing in at least double the time, but the article was easily the best resource I've found on the internet for understanding UTF-8, largely thanks to how well the code page was thought out and put together.
I have only read some of the other text on the article previously, never in full before and I agree the article has had problems with bloat. In my mind this still does not mean the most useful thing should be removed in favour of briefness (it's essentially just a picture/diagram, but one that you can interact with to get more out of. The readers can easily identify that rough class of thing and skip it when they don't want to inspect it. It's very obviously not part of the text you're supposed to read out loud for example.) Mossymountain (talk) 05:36, 21 September 2024 (UTC)[reply]

I'm not relitigating basic, universally-understood concepts such as "articles should not be hundreds of kilobytes long", "articles should not use colours to convey important information" or "articles are not supposed to be reference textbooks". These are simply settled consensus. The code page table is absolutely useless for any purpose other than implementing handling of the format, which is categorically not the point of an encyclopedia article. What this article should do is explain where UTF-8 fits into the world, how it has been adopted, and how at some basic level it works. Precisely what any given sequence of bytes happens to stand for (other than in explaining how the byte sequence informs multi-byte code points) is not pertinent, especially because the lowest seven bytes were very deliberately copied from ASCII anyway.

Frankly, the major thing I gleaned from the above wall of text (and that on my talk page) is that the editor posting it hasn't actually read the article very closely. A lot of the trimming down that was performed on the text was precisely because the article should put more emphasis on UTF-8's unique features, primarily its variable-length encoding and how multiple code points can be combined into a single glyph. I argued against the (seemingly political) removal of some of that detail in the previous section of this talk page, so it makes no sense to argue that this has somehow been de-emphasised by the removal of unrelated trivia.

This article still needs a lot of work. What it does not need is the re-addition of huge, heavy blocks of content of absolutely no value outside of a reference textbook. Chris Cunningham (user:thumperward) (talk) 11:03, 21 September 2024 (UTC)[reply]

I am not arguing those points. At least I don't think I am. The closest one is probably the third one: "articles are not supposed to be reference textbooks". I will happily concede my positions whenever I get how they break them. (I'm unable to find what you're referencing here, but what I'm arguing for shouldn't be in conflict with it, at least not with what kind of idea I assume the phrase is getting at)
"How multiple code points can be combined into a single glyph" has nothing to do with UTF-8. I wrote about this at #Other issues in the article above.
Combining differing amounts of bytes to single code points on the other hand is the defining characteristic of a variable length character encoding, such as UTF-8 and its "cousins", like Shift JIS and GBK. (The links go to the respective code page layout-equivalents on the articles.)
I have read the full article, as I said here when talking about how the code page has been very useful for me personally; "I have only read some of the other text on the article previously, never in full before and I agree the article has had problems with bloat." (Emphasis added, I didn't catch how ambiguous this was when proofreading!)
I think one of the best things about such a table/picture is how it helps you build a mental map in order to get a better understanding about what you're reading: It's essentially the "picture" of the thing, what it logically looks like. Especially with colour (or some other way to subconsciously differentiate sections), it's a powerful way to visually identify and to "map" it in the brain for better understanding. This leverages the fact that visual recognition is the single strongest way for humans to match patterns and receive data. This process is largely automatic, and thus requires very little effort in comparison to constructing the "map" from scratch by reading rules about the subject. "A picture is worth a thousand words" etc. etc. This is more true the more complicated a subject is. I compared this to using diagrams on articles about mathematical concepts in the #cylinder example.
Some topics benefit greatly from such additional illustration and I believe this is one of those cases. I think that articles like this SHOULD at least show the corresponding code page, as it efficiently and intuitively summarizes the encoding. As I wrote above at "#Precedents/examples in other articles", it looks like all similar articles about 8-bit(== such a table is small) character encodings have an equivalent table or picture.
I previously thought it was neat how UTF-8's table had additional information sprinkled in (like the hover-over Unicode ranges per start byte), but I can see how this is just extra clutter. Shift_JIS#Shift_JIS_byte_map is very clean in comparison, only listing the actual code points as text.
About the code page being "useless for any purpose other than implementing handling of the format"; I think this is almost the other way around. In comparison to reading about a topic, when programming something I want the written details/rules instead. A picture can also help, but mainly because it helps me understand the thing itself better in general, just like when just reading about it for my own sake.
I currently interpret the rationale for edits A and B as

Since these poorly laid out sections have both internal and external repetition, while not even close to proper essay form, it should all be removed in order to make it more inviting for someone to later write about things including some of the points from these sections. Currently, virtually no one would probably even attempt to do that because it would always end up repeating these sections, and gradually removing parts from such a consolidated an interdependent form of data is virtually impossible.

I agree with that in general. It's just that I found the approach almost irresponsibly heavy-handed.
I think the main disagreement here is whether an appropriate article should include technically redundant (able to be deduced when consciously spending effort to) illustrations or examples, when the rules are already explained in pure writing. I think a tiny number of pertinent examples and clarifying illustrations can greatly enhance the readability/ease of understanding of topics like this. Both to help make readers previously unfamiliar with topic ready to accept the details and to give returning ones a quick refresher, drastically reducing the need to read much of the text itself again. In addition to that, I'd wager most readers don't always read full articles (or even paragraphs), but instead try to skim through to find something they're after and illustrations and examples are precisely those kind of "gold nuggets"; dense, yet easily digestible information. (When time is of the essence, I definitely do this in order to "wring the information out" and these kind of things help a lot.)
I don't think every guideline about what the ideal article should look like is supposed to be followed as strictly as technically possible and the resulting prototype applied 1:1 on every article to harshly cull the inharmonious parts off.
In regards to colours... (merged to #Helper colours above.) Mossymountain (talk) 06:30, 22 September 2024 (UTC)[reply]

Unicode no. of characters wrong

[edit]

Unicode has 1,111,412 characters. Please make this change. FrierMAnaro (talk) 14:17, 31 October 2024 (UTC)[reply]

0x110000 is 1,114,112, but the number shown is after subtracting the 2048 surrogate halves (I disagree but the consensus was that they should not count) Spitzak (talk) 17:58, 31 October 2024 (UTC)[reply]
Indeed, the Unicode Standard explicitly states it contains 1,114,112 code points right in its introduction, but there are much fewer characters. We're just quite loose in distinguishing between code points, characters, Unicode scalar values, and not well-defined ad-hoc phrases like valid Unicode code points as currently used in the second paragraph of the article. UTF-8 does not encode "code points" or "characters" but "Unicode scalar values" (D76). There are 1,112,064 of these. Not all are assigned to characters yet; some are explicitly designated noncharacters. UTF encodings can encode them all, but there are no well-formed sequences of code units that would represent surrogate code points. The wording is grossly imprecise, but the numbers are correct. – MwGamera (talk) 23:10, 31 October 2024 (UTC)[reply]
I changed it to say "Unicode scalar values" and added a citation of the Unicode 16.0.0 standard to the reference for the number. Guy Harris (talk) 21:53, 1 November 2024 (UTC)[reply]
Surrogate halves are "code points", but they are not themselves individually "characters" in the most common meaning of the term. They're elements which can be used in pairs to encode characters. AnonMoos (talk) 18:17, 1 November 2024 (UTC)[reply]

Tooltips for code points

[edit]

Can you add a tooltip? Add a tooltip to every cell of the table which shows the range of code points the byte can encode. Also add tooltips for characters beyond the 10FFFF. FrierMAnaro (talk) 07:14, 17 November 2024 (UTC)[reply]

Alternative conversion table

[edit]

I have always found the conversion table a little confusing, so I made a more simple alternative.

https://x.com/LatinSuD/status/1869138590271488375/photo/1

If you like it, I (or somebody) could try to complete it and convert to SVG maybe? LatinSuD (talk) 22:09, 17 December 2024 (UTC)[reply]

We do not recommend additional media rendered as images what should really be text. Remsense ‥  22:28, 17 December 2024 (UTC)[reply]
Looks kind of nice, but there is a desire to keep the table resembling the references, which just use text. Spitzak (talk) 00:30, 18 December 2024 (UTC)[reply]