Talk:UTF-8: Difference between revisions

Content deleted Content added

Inline

Latest revision as of 18:36, 21 December 2024

This is the talk page for discussing improvements to the UTF-8 article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
Mid	This article has been rated as Mid-importance on the project's importance scale.

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Typography Mid‑importance

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography
Mid	This article has been rated as Mid-importance on the importance scale.

Archives

Index

Archive 1	Archive 2	Archive 3
Archive 4	Archive 5

This page has archives. Sections older than 730 days may be automatically archived by .

Table should not only use color to encode information (but formatting like bold and underline)

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices. — Preceding unsigned comment added by 88.219.179.109 (talk • contribs) 02:26, 17 April 2020‎ (UTC)[reply]

Microsoft script dead link

   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad

   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.

   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talk • contribs) 02:58, 5 April 2021 (UTC)[reply]

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)[reply]

The article contains "{{efn", which looks like a mistake.

I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. 2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094 (talk) 16:17, 2 April 2024 (UTC)[reply]

I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. Indefatigable (talk) 19:03, 2 April 2024 (UTC)[reply]

Should "The Manifesto" be mentioned somewhere?

More specifically, this one: https://utf8everywhere.org -- Preceding unsigned comment added by Rudxain (talk o contribs) 21:52, 12 July 2024 (UTC)[reply]

Only if it's got significant coverage in reliable sources. Remsense 22:10, 12 July 2024 (UTC)[reply]

It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing Windows NT 3.1, and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... AnonMoos (talk) 20:30, 15 July 2024 (UTC)[reply]

The number of 3 byte encodings is incorrect

This sentence is incorrect:

Three bytes are needed for the remaining 61,440 codepoints...

FFFF - 0800 + 1 = F800 = 63,488 three byte codepoints.

The other calculations for 1, 2, and 4 byte encodings are correct. Bantling66 (talk) 02:56, 23 August 2024 (UTC)[reply]

You forgot to subtract 2048 surrogates in the D800–DFFF range. – MwGamera (talk) 08:58, 23 August 2024 (UTC)[reply]

Multi-point flags

I'm struggling to assume good faith here with this edit. A flag which consists of five code points is already sufficiently illustrative of the issue being discussed. That an editor saw fit to first remove that example without discussion, and then to swap it out for the other example when it was pared down to one flag, invites discussion of why that particular flag was removed, and the obvious answer isn't a charitable one. Chris Cunningham (user:thumperward) (talk) 12:35, 17 September 2024 (UTC)[reply]

Yes it was restored to the pride flag for precisely the reasons you state. Spitzak (talk) 20:48, 17 September 2024 (UTC)[reply]

A better, more in-depth explanations of the flags can be found on the articles regional indicator symbol and Tags_(Unicode_block)#Current_use (the mechanism for these specific flags). I don't think it belongs in articles of specific character encodings like UTF-8 at all.

The fact that one code point does not necessarily produce one grapheme has nothing to do with a specific character encoding like UTF-8. It's a more fundamental property of the text itself and any encoding that can be used to encode some string of characters decodes back to the same characters when decoded back from the binary representation. Although very popular, UTF-8 is just one of the numerous ways to encode text to binary and back.

I wrote more about this below at Other issues in the article and sadly only then noticed this was already being somewhat discussed here. Mossymountain (talk) 10:45, 20 September 2024 (UTC)[reply]

Why was the "heart" of the article, almost the whole section of UTF-8#Encoding (Old revision) removed instead of adding a note?

NOTE: The section seems to have been renamed (UTF-8#Encoding -> UTF-8#Description) in this edit.

I don't understand why such a large part of UTF-8#Encoding (old revision) was suddenly removed in this edit (edit A), and then this edit (edit B) (diff after both edits) instead of either:

Adding a note about parts of it being written poorly.
Rewriting some of it. (the best and the most difficult option)
Carefully considering removing parts that were definitely redundant (such as arguably the latter part of UTF-8#Examples (old revision)).

Both of the edits removed a separate, and quite a well-written example (at least for my brain, these very examples made understanding UTF-8 require significantly less effort spent thinking). I don't think removing them was a good decision. Yes, you could explain basically anything without using examples, but in my experience an example is usually the easiest and fastest way for someone to understand almost any concept, especially when the examples were so visual and beautifully simple. I see it in the same category as a lecturer speaking with his hands and writing+drawing relevant things on a whiteboard versus having to hold the lecture by speaking over the phone.

The 1st, edit A

→‎Encoding: this entire section is almost completely opaque and its inclusion stymies the addition of some clear prose describing how unicode is decoded
— user:Thumperward, (edit A)

To me, this reads as if UTF-8 was accidentally conflated with Unicode, causing a mistake to remove the parts from the wrong article (Having thought about it more, I now think it's) a severe disagreement of article design/presentation style.
(I still think edit notes asking for rewrites would have been the way to go instead of nuking the information, and that for some of the items, an article-like rewrite would be the wrong choice: Some data is way more enjoyable and simple to read visually from a table than it is to glean from written or spoken word and, as such, should be visualized in a table.)

I am strongly of the mind that the deleted parts included the two most important parts of the whole article, that must definitely be included as they are the very core of the article:

The UTF-8#Codepage layout (old revision), in my opinion the most important part of any article about a character encoding. This part was in my opinion also designed, formatted and written exemplarily well here. The colour palette could be adjusted accordingly if it's a problem for the colour-blind.
- Precedents/Examples in other articles about specific character encodings:
- Variable multi-byte ("cousins of UTF-8"):
  - Shift_JIS#Shift_JIS_byte_map
  - GBK_(character_encoding)#Encoding
- single byte:
  - ASCII#Character_set (a strict subset of UTF-8)
  - Code_page_437#Character_set
  - ISO/IEC_8859-1#Code_page_layout
The first list (numbered 1..7) of UTF-8#Examples (old revision) that clearly, by a singular simple example demonstrates how UTF-8 works. (I agree it could be rewritten, the language used is quite verbose)

Sweeping the less important items under these rugs to make this seem shorter:

The 2nd, edit B

→Encoding: this now refers to removed text and contradicts repeated assertions elsewhere that overlong encodings are unnecessary
— user:Thumperward, (edit B)

This edit removed the whole section UTF-8#Overlong encodings (old revision). I disagree with its removal.

The example removed in this edit was a clear and easy to understand way of explaining what an overlong encoding means.
I don't understand what the deleted text is referred to have contradicted, unless this is something like the mention in UTF-8#Implementations and adoption of Java's "Modified UTF-8" that uses an overlong encoding for the null character. Overlong encodings aren't merely "unnecessary", they are *utterly forbidden*/invalid/illegal.
- Apart from the lacking citation, which probably should have been rfc3629 § 3, I don't understand what was wrong with the second paragraph. I also consider the information presented in it essential for the article. (A simple decoder implementation could easily just pass the overlong encodings as if they were single-byte characters, or choose to simplify encoding by using a fixed length. The paragraph gives two good reasons why such encodings are illegal, that are now completely gone from the article.)

About removing helper colours and edit C

The 3rd, edit C

This is about font colouring on UTF-8#Encoding (old version), it reverts this edit by User:Nsmeds. The textual information stays the same between the two, the edit only removes the custom colours.

I would prefer some form of colouring to be added back.
Properly selected helper colours shouldn't be against anything:
I don't think {{colorblind|section}} or Wikipedia:Manual_of_Style/Accessibility#Color are at all suggesting the wiping of non-essential helper colours when they could only be potentially hard to distinguish from each other. What is definitely suggested instead is fixing situations with colouring that can make the text hard to read (colouring that can be assumed to potentially lead to a low contrast between the text and its background for any reader).

→Encoding: fix the colour blindness issue
— user:Thumperward, (edit C)

This is attempting to fix a potential issue for the colour-blind, but I think it unfortunately only ends up denying the help the colour was there to provide from both the colour-blind and not. The colours were NEVER the primary way to convey any data, but an additional help to make the parsing of the information faster and less straining to the eye (removing the need to count anything, you don't need to know that a hex digit covers 4 bits, or that the 0x7 on the left column corresponds to the first xxx on the right, and whether you do or don't, you just instantly see the relationship without thinking. This is obviously highly desirable in data visualization.

Even without doing anything to suboptimal colours, when they are only potentially hard to distinguish from each other instead of the background, the remaining distinguishable groups still serve the original purpose, only with some of it missing or hard to see. The monochrome version ends up being strictly worse.

Another way is to replace the straight x's with different symbols and have the key indicated on the ranges somehow, a mock-up:
U+0080(xyz) .. U+07FF(xyz) | 110xxxyy 10yyzzzz (hex digit resolution)
U+0080(xyy) .. U+07FF(xyy) | 110xxxyy 10yyyyyy (byte resolution) and this can be in addition to colouring that doesn't sacrifice contrast for anyone.

I just tried something like that in these edits. It's not ideal, especially how it makes the sentence before it quite unpleasant to read.

I think these should be considered before removing colour outright:

Do the colours used here even have a problem with contrast with the background, (or only amongst themselves and they are not providing information)? Maybe it's just that we should avoid the potential low-contrast combinations even for those with normal vision, such as:
- Overly bright colours, such as bright yellow (after switching to light background, I really struggle to read "bright yellow" there)
- Overly dark colours, such as deep blue (after switching to dark background, I struggle to read "deep blue" there)
- Colours close to even the rest of the corresponding brightnesses between the light and dark mode background and their respective overlay backgrounds like this one of <code>

I think the least total effort catch-all long-term solution would be to provide a site-wide toggle on the side that overrides all text and background colouring when you want, probably makes sense beside the existing "Light" and "Dark" mode toggles, to force foreground elements close to the opposite end.

The other solution to fix all of what edit C attempted to fix, (and the solution applicable right here and now) would be to use a palette that is also readable for the colour blind, such as these three palettes found on Color_blindness#Ordered_Information that can be used to produce distinct colours that work no matter of colour-blindness.

NOTE: They ALL work for ALL types of colour blindness, it's just a choice of which one looks the nicest.
Do keep in mind however that all of the selected colours still need to have good contrast from both light and dark backgrounds, so maybe the colours from the very edges of these aren't usable, like how I attempted to demonstrate above with blue and yellow.

Other issues in the article (solved)

The UTF-8 article does talk about generic things about Unicode quite a bit more than I think it should, such as explaining how some "graphical characters can be more than 4 bytes in UTF-8". This is because Unicode (and by extension UTF-8) does not deal in graphemes in the first place, but code points (essentially just numbers to index into Unicode), which can correspond to valid Unicode characters, which in turn can directly correspond to a grapheme. Some characters don't correspond to a grapheme at all (control characters), such as the formatting tag characters used in the flag example, and some combine/join with other character(s) to to produce a combination grapheme (combining/joining characters).
The possibility of needing to use multiple code points for one grapheme like that is a direct consequence of these types of characters in general and isn't caused by UTF-8 or any other encoding, and can happen through ANY and all encodings capable of encoding such code points, not just UTF-8.
In short: The issue has nothing to do with UTF-8.

Mossymountain (talk) 05:09, 20 September 2024 (UTC) Mossymountain (talk) 17:10, 20 September 2024 (UTC)[reply]

Because the editor was offended that that section used color. Akeosnhaoe (talk) 08:56, 20 September 2024 (UTC)[reply]

It's pretty important that we not communicate information solely through color, but I wonder how we could better do something like that. Remsense ‥ 论 09:02, 20 September 2024 (UTC)[reply]

Most of the information wasn't in the color, it was in the text readable without formatting in monochrome. The color was there just to make it easier to quickly identify which is which.
If what Akeosnhaoe said is the case (which I don't think it is, I think this was an honest misunderstanding with good intentions), obviously the colors should be changed to the intended visibility standard, not the information removed. Mossymountain (talk) 10:17, 20 September 2024 (UTC)[reply]

IMHO the edits made by user:thumperward were a good and powerful attempt to remove the obscene bloat of this article. The enourmous complex "examples" with color did not provide any information, and it is quite impossible to figure out what the colors mean without already knowing how UTF-8 works already. Elimination of the "code page" is IMHO a good and daring decision, one I may not have made and I'm glad he tried it. I'd like to continue, pretty much removing the bloated mess of "comparisons" that are either obvious or that nobody cares about, the few useful bits of info there can be merged into the description.Spitzak (talk) 18:13, 20 September 2024 (UTC)[reply]

My most important point, by far, is that I vehemently disagree with the removal of the code page.

It is the single thing with the most useful information packed on the article and irreplaceable in utility. I don't understand what was wrong with it at all. I see its removal as the same kind of hindrance as deleting all of the drawings that visualize what measurements the letters h, r, d represent on a cylinder from that article.
This makes it a lecture where the professor can only attend by talking over the phone. No gestures, no diagrams, nothing. It "does still work", it's just requires more effort from the students (and from the professor, but that's a one-time cost here) Yes, you technically still can glean all of the same information by reading through the article and spending effort to understand what you read, but it would outright DENY the use case where one just looks at a picture or two for a couple of seconds and is already able to close the article, while hindering the rest of the readers by not providing the still useful clarification as study aides.

I'm firmly in the camp that believes that for virtually all human readers, some well thought out visualizations illustrating some concept's defining characteristics only help in understanding, they are the best way to essentially convey "what something looks like", be it logically (like in this case) or physically. I personally have visited the UTF-8 page specifically for the code page for years whenever I needed a refresher when dealing with the encoding. Sure, I could have dug up a cumbersome specification and ^F'd through it to achieve the same thing in at least double the time, but the article was easily the best resource I've found on the internet for understanding UTF-8, largely thanks to how well the code page was thought out and put together.

I have only read some of the other text on the article previously, never in full before and I agree the article has had problems with bloat. In my mind this still does not mean the most useful thing should be removed in favour of briefness (it's essentially just a picture/diagram, but one that you can interact with to get more out of. The readers can easily identify that rough class of thing and skip it when they don't want to inspect it. It's very obviously not part of the text you're supposed to read out loud for example.) Mossymountain (talk) 05:36, 21 September 2024 (UTC)[reply]

I'm not relitigating basic, universally-understood concepts such as "articles should not be hundreds of kilobytes long", "articles should not use colours to convey important information" or "articles are not supposed to be reference textbooks". These are simply settled consensus. The code page table is absolutely useless for any purpose other than implementing handling of the format, which is categorically not the point of an encyclopedia article. What this article should do is explain where UTF-8 fits into the world, how it has been adopted, and how at some basic level it works. Precisely what any given sequence of bytes happens to stand for (other than in explaining how the byte sequence informs multi-byte code points) is not pertinent, especially because the lowest seven bytes were very deliberately copied from ASCII anyway.

Frankly, the major thing I gleaned from the above wall of text (and that on my talk page) is that the editor posting it hasn't actually read the article very closely. A lot of the trimming down that was performed on the text was precisely because the article should put more emphasis on UTF-8's unique features, primarily its variable-length encoding and how multiple code points can be combined into a single glyph. I argued against the (seemingly political) removal of some of that detail in the previous section of this talk page, so it makes no sense to argue that this has somehow been de-emphasised by the removal of unrelated trivia.

This article still needs a lot of work. What it does not need is the re-addition of huge, heavy blocks of content of absolutely no value outside of a reference textbook. Chris Cunningham (user:thumperward) (talk) 11:03, 21 September 2024 (UTC)[reply]

I am not arguing those points. At least I don't think I am. The closest one is probably the third one: "articles are not supposed to be reference textbooks". I will happily concede my positions whenever I get how they break them. (I'm unable to find what you're referencing here, but what I'm arguing for shouldn't be in conflict with it, at least not with what kind of idea I assume the phrase is getting at)

I wrote about this at #Other issues in the article above.

Combining differing amounts of bytes to single code points on the other hand is the defining characteristic of a variable length character encoding, such as UTF-8 and its "cousins", like Shift JIS and GBK. (The links go to the respective code page layout-equivalents on the articles.)

I have read the full article, as I said here when talking about how ; "I have only read some of the other text on the article previously, never in full before and I agree the article has had problems with bloat." (Emphasis added, I didn't catch how ambiguous this was when proofreading!)

I think one of the best things about such a table/picture is how it helps you build a mental map in order to get a better understanding about what you're reading: It's essentially the "picture" of the thing, what it logically looks like. Especially with colour (or some other way to subconsciously differentiate sections), it's a powerful way to visually identify and to "map" it in the brain for better understanding. This leverages the fact that visual recognition is the single strongest way for humans to match patterns and receive data. This process is largely automatic, and thus requires very little effort in comparison to constructing the "map" from scratch by reading rules about the subject. "A picture is worth a thousand words" etc. etc. This is more true the more complicated a subject is. I compared this to using diagrams on articles about mathematical concepts in the #cylinder example.

Some topics benefit greatly from such additional illustration and I believe this is one of those cases. I think that articles like this SHOULD at least show the corresponding code page, as it efficiently and intuitively summarizes the encoding. As I wrote above at "#Precedents/examples in other articles", it looks like all similar articles about 8-bit(== such a table is small) character encodings have an equivalent table or picture.

I previously thought it was neat how UTF-8's table had additional information sprinkled in (like the hover-over Unicode ranges per start byte), but I can see how this is just extra clutter. Shift_JIS#Shift_JIS_byte_map is very clean in comparison, only listing the actual code points as text.

About the code page being "useless for any purpose other than implementing handling of the format"; I think this is almost the other way around. In comparison to reading about a topic, when programming something I want the written details/rules instead. A picture can also help, but mainly because it helps me understand the thing itself better in general, just like when just reading about it for my own sake.

I currently interpret the rationale for edits A and B as

Since these poorly laid out sections have both internal and external repetition, while not even close to proper essay form, it should all be removed in order to make it more inviting for someone to later write about things including some of the points from these sections. Currently, virtually no one would probably even attempt to do that because it would always end up repeating these sections, and gradually removing parts from such a consolidated an interdependent form of data is virtually impossible.

I agree with that in general. It's just that I found the approach almost irresponsibly heavy-handed.

I think the main disagreement here is an appropriate article should include technically redundant (able to be deduced when consciously spending effort to) illustrations or examples, when the rules are already explained in pure writing. I think a tiny number of pertinent Both to help make readers previously unfamiliar with topic ready to accept the details and to give returning ones a quick refresher, drastically reducing the need to read much of the text itself again. In addition to that, I'd wager most readers don't always read full articles (or even paragraphs), but instead try to skim through to find something they're after and illustrations and examples are precisely those kind of "gold nuggets"; dense, yet easily digestible information. (When time is of the essence, I definitely do this in order to "wring the information out" and these kind of things help a lot.)

I don't think every guideline about what the ideal article should look like is supposed to be followed as strictly as technically possible and the resulting prototype applied 1:1 on every article to harshly cull the inharmonious parts off.

~~In regards to colours...~~ (merged to #Helper colours above.) Mossymountain (talk) 06:30, 22 September 2024 (UTC)[reply]

Unicode no. of characters wrong

Unicode has 1,111,412 characters. Please make this change. FrierMAnaro (talk) 14:17, 31 October 2024 (UTC)[reply]

0x110000 is 1,114,112, but the number shown is after subtracting the 2048 surrogate halves (I disagree but the consensus was that they should not count) Spitzak (talk) 17:58, 31 October 2024 (UTC)[reply]

Indeed, the Unicode Standard explicitly states it contains 1,114,112 code points right in its introduction, but there are much fewer characters. We're just quite loose in distinguishing between code points, characters, Unicode scalar values, and not well-defined ad-hoc phrases like valid Unicode code points as currently used in the second paragraph of the article. UTF-8 does not encode "code points" or "characters" but "Unicode scalar values" (D76). There are 1,112,064 of these. Not all are assigned to characters yet; some are explicitly designated noncharacters. UTF encodings can encode them all, but there are no well-formed sequences of code units that would represent surrogate code points. The wording is grossly imprecise, but the numbers are correct. – MwGamera (talk) 23:10, 31 October 2024 (UTC)[reply]

I changed it to say "Unicode scalar values" and added a citation of the Unicode 16.0.0 standard to the reference for the number. Guy Harris (talk) 21:53, 1 November 2024 (UTC)[reply]

Surrogate halves are "code points", but they are not themselves individually "characters" in the most common meaning of the term. They're elements which can be used in pairs to encode characters. AnonMoos (talk) 18:17, 1 November 2024 (UTC)[reply]

Tooltips for code points

Can you add a tooltip? Add a tooltip to every cell of the table which shows the range of code points the byte can encode. Also add tooltips for characters beyond the 10FFFF. FrierMAnaro (talk) 07:14, 17 November 2024 (UTC)[reply]

Alternative conversion table

I have always found the conversion table a little confusing, so I made a more simple alternative.

https://x.com/LatinSuD/status/1869138590271488375/photo/1

If you like it, I (or somebody) could try to complete it and convert to SVG maybe? LatinSuD (talk) 22:09, 17 December 2024 (UTC)[reply]

We do not recommend additional media rendered as images what should really be text. Remsense ‥ 论 22:28, 17 December 2024 (UTC)[reply]

Looks kind of nice, but there is a desire to keep the table resembling the references, which just use text. Spitzak (talk) 00:30, 18 December 2024 (UTC)[reply]