Jump to content

Talk:Comparison of Unicode encodings

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is the current revision of this page, as edited by Spitzak (talk | contribs) at 01:51, 12 June 2024 (Processing time misconception). The present address (URL) is a permanent link to this version.

(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

CJK characters uses 3 bytes in UTF-8?

[edit]

This article states that "...There are a few, fairly rarely used codes that UTF-8 requires three bytes whereas UTF-16 requires only two..."; but it seems to me that most CJK characters take 3 bytes in UTF-8 but 2 bytes in UTF-16?76.126.165.196 (talk) 08:32, 25 February 2008 (UTC)[reply]

I think you are right. And western people generally don't care about that... --96.44.173.118 (talk) 16:20, 13 December 2011 (UTC)[reply]

This does sound poorly worded. However it should also be made clear that real CJK text in actual use on computers usually contains so much ASCII (number, spaces, newlines, XML markup, quoted English, etc) that they are *still* shorter in UTF-8 than UTF-16. In addition it should be pointed out that most CJK characters are entire words equivalent to 3-7 characters in English and thus they already have a huge compression advantage.
Some alphabetic languages from India do have a 3-byte UTF-8 encoding of all their letters. Since their words consist of multiple characters they can end up bigger in UTF-8 than UTF-16, and there have been complaints about this. Any comment about length should mention these languages where it actually is a problem.Spitzak (talk) 21:13, 13 December 2011 (UTC)[reply]

Requested move

[edit]

This article appears to be a sub-page of Unicode, which is ok; but it should have an encyclopedic name that reflects its importance (that of an article on Unicode encodings, rather than some evaluative comparison). —donhalcon 16:26, 7 March 2006 (UTC)[reply]

It should be moved to Unicode encodings. Once that's done, the opening sentences should be redone to inform readers on the basic who/what/why. --Apantomimehorse 10:11, 10 July 2006 (UTC)[reply]

UTF-24?

[edit]

hex 110000, the grand total of 17 Planes, obviously takes 21 bits, which comfortably fit into 3 bytes (24 bits). So why would anyone want to encode 21 bits in 32 bits? the fourth byte is entirely redundant. What, then, is the rationale behind having UTF-32 instead of "UTF-24"? Just a superstitious fear of odd numbers of bytes? dab () 12:47, 6 July 2006 (UTC)[reply]

It's more than superstitious fear of odd numbers of bytes - it is a fact that most computer architectures can process multiples of bytes equal to their word size quicker. Most modern computers use either a 32 bit or 64 bit word. On the other hand, modern computers are fast enough that the speed difference is irrelevant. It is also true that most computer languages provide easy ways to refer to those multiples. (For example, in C on a 32 bit machine, you can treat UTF-32 in the machines's native byte order as an array of integers.) --LeBleu 23:01, 7 July 2006 (UTC)[reply]
Why not ask why we don't have UTF-21, since the last three bits in UTF-24 would be entirely redundant? Same issue, basically, but on a different scale (the hypothetical UTF-21, if actually stored as 21-bit sequences, would be much slower to process without noticeable size gain). Word sizes tend to be powers of two, so if data can be presented as (half)-word sized at little extra cost, this will be done unless there are overriding reasons of space economy. And if you want space economy, you should use UTF-16 anyway, since the extra processing power you must pay for characters outside the BMP is (usually) not significant enough to warrant using twice as much storage.
Nothing actually prohibits you from layering another encoding over UTF-32 that stores the values in three bytes, as long as you supply the redundant byte to anything that advertises itself as processing UTF-32. This is unlikely to be of much advantage, though. 194.151.6.67 11:36, 10 July 2006 (UTC)[reply]
so the fourth byte is really redundant, and hangs around in memory for faster processing speed. I imagine that all UTF-32 files will have to be compressed as soon as they are stored anywhere; the question then is, which is more of a waste of processing power, compressing and uncompressing the files, or adding a zero byte at read-time before further processing? UTF-8 is only economical if the overwhelming majority of characters are in the low planes. Assume (for argument's sake) a text with characters evenly distributed in the 17 planes: UTF-8 would be out of the window, but 'UTF-24' might have an advantage over UTF-32 (obviously "UTF-21" would be even more economical, but that would really mean a lot of bit-shifting). dab () 18:14, 29 July 2006 (UTC)[reply]
To answer your direct question adding an extra byte after every 3 will be far far less processing than implmenting something like deflate having said that i can't see many situations where you would do it.
Text with characters evenly distributed among the planes is going to be very very rare. Only 4 planes have ever had any allocations at all (BMP, SMP, SIP and SSP), only two of those contain character ranges for complete scripts (the other two are rare CJK ideographs and special control codes) and most texts will be highly concentrated on a few small ranges.
If you are concerned with storage space and you are dealing with a lot of non-bmp characters in your text (say an archive of tolkins tengwar and kirth manuscripts) then you will have to choose between possibilities such as a custom encoding, compressing encodings like SCSU and BOCU and general purpose compression algorithms like deflate. With most systems however even if individual documents are non-bmp the overwhelming majority of characters in the system as a whole are in the BMP.
A final point, if heavy use is made of HTML or XML or similar markup languages for formatting the ascii characters of the markup can easilly far outnumber the characters of the actual document text. Plugwash 23:06, 29 July 2006 (UTC)[reply]
My 2¢ regarding the possibility of UTF-24. I don't think the reasons given so far hold much water, and tend to simply justify a situation as it exists now, rather than recognizing Unicode an evolving standard. Before considering the need for UTF-24, its important to consider the way Unicode actively seeks to assign codepoints. Since Unicode tries to assign all graphemes from actively used languages in the BMP, UTF-16 surrogate characters are seldom needed for most documents except in certain esoteric circumstances. That means that in most cases character count equals bytes ÷ 2. It also means most documents have the bulk of their content expressed without surrogate characters in UTF-16 or with only 1-3 bytes in UTF-8 (since UTF-8 only turns to 4 bytes for characters outside the BMP). Therefore due to the way Unicode assigns graphemes to codepoints the bulk of most documents will be characters from the BMP drawing only occasionally on non-BMP characters.
UTF-32 has the advantage of always making it very quick and easy to count characters (bytes ÷ 4). However, UTF-32 also leads to much larger memory use (2 to 4 times as much depending on the text processed). The nice thing about UTF-24 is that it would provide the speed benefits without while still conserving 1/3 the memory (at least for text processing outside the BMP). In many ways I think that UTF-24 offers little over UTF-32 for internal processing of text. However, for storage of text outside the BMP (largely academic centered documents from ancient scripts and using seldom used characters). However for special cases of academic documents, UTF-24 could provide a valuable space-saving transform format for Unicode characters. Especially files containing music characters, ancient writing, and perhaps even academic CJK writing, UTF-24 could conserve disk space (and as Plugwash said, things like music manuscripts will likely have a large proportion of latin1/ascii block characters where UTF-8 might conserve as much disk space as a hypothetical UTF-24).
One last thing about a fixed-width encoding (which was among some of the original goals of Unicode that didn't really take hold). The development of Unicode has shown that more important than raw character count is likely less important than grapheme cluster count (and this affects characterAtIndex as well). While this may be often acknowledged, I am not aware of many (any?) implementations that string count methods/functions grapheme cluster rather than characters. In fact I think some implementations return UTF-16 string counts as bytes ÷ 2, so the implementation actually ignores the surrogate problem entirely. So with these complications, an implementation really needs to count characters in a different way that it counts bytes to really return an accurate grapheme cluster count, taking into account surrogate pairs as a single grapheme and combining characters as not part of the count. So UTF-24 doesn't really help with this all that much either (though it does eliminate the surrogate pair issue).
In the end I think UTF-24 could become useful for special case academic documents. UTF-32 might become more popular for internal processing (as in not for file formats) since we already see memory usage go from 32 bits to 64 bits for things like pointers and longs, its not such a stretch to see Unicode implementations go from 16 to 32 bits too. Much of this likely depends on what else gets assigned outside the BMP. —Preceding unsigned comment added by Indexheavy (talkcontribs) 03:59, 4 November 2008 (UTC)[reply]
In my experiance the string length functions in languages typically give you a length in code units (bytes for UTF-8, 16 bit words for UTF-16), This is generally what you want since afaict the most common uses of string length are either to iterate over the string or do some sanity checking of the size. If you need the grapheme cluster count, code point count or console position count you will have to use special functions to get them but afaict most applications don't need any of them. Plugwash (talk) 18:53, 7 November 2008 (UTC)[reply]
There is UTF-18. It can only represent plane 0, 1, 2 and 14. — Preceding unsigned comment added by PiotrGrochowski000 (talkcontribs) 09:13, 6 March 2015 (UTC)[reply]

UTF-7,5 ?

[edit]

See this page [1] which describes the encoding. Olivier Mengué |  23:19, 22 May 2007 (UTC)[reply]

Question?

[edit]

So what is the most popular encoding??? —Preceding unsigned comment added by 212.154.193.78 (talk) 07:52, 15 February 2008 (UTC)[reply]

UTF-8 is popular for latin-based text, while UTF-16 is popular for asian text. And everyone hates UTF-32 ;-) 88.68.223.62 (talk) 18:32, 27 March 2008 (UTC)[reply]
Not really, while UTF-8 is more compact than UTF-16 for most alphabetic scripts and UTF-16 is smaller than UTF-8 for CJK scripts then UTF-8 the descision is often based on considerations other than sise (legacy encodings are also commonly used but we will focus on unicode encodings here).
In the unix and web worlds UTF-8 dominates because it is possible to use it with existing ascii based software with little to no modification. In the windows NT .net and java worlds UTF-16 is used because when those APIs were designed unicode was 16 bit fixed width and UTF-16 was the easiest way to retrofit unicode support. There are one or two things that use UTF-32 (I think python uses it under certain compile options and some C compilers make wchar_t 32 bit) but mostly it is regarded as a very wastefull encoding (and the advantage of being fixed width turns out to be mostly an illusion once you implement suport for combining characters). Plugwash (talk) 21:43, 4 April 2008 (UTC)[reply]
Why not to teke into account those considerations in an internet/popularity section? — Preceding unsigned comment added by 84.100.195.219 (talk) 20:13, 26 June 2012 (UTC)[reply]

Mac OS Reference

[edit]

This seems to be a bit out of date. I just searched the reference library and can not come up with anything in the current version of Mac OS regarding UTF-16. Since the cited material is two revisions (10.3 vs. the current 10.5) AND since Mac OS has understands UTF-8, the fact that it uses UTF-16 in a previous version for INTERNAL system files, is irrelevant. I suggest this be removed. Lloydsargent (talk) 14:08, 24 April 2008 (UTC)[reply]

UTF-8 with BOM!!!

[edit]

"A UTF-8 file that contains only ASCII characters is identical to an ASCII file"—Only with the strictest (too strict) reading is this true. A UTF-8 file could have a BOM, which then would not "[contain] only ASCII characters." Can someone re-word this without making it require such a strict reading yet still be simple? —Preceding unsigned comment added by 72.86.168.59 (talk) 18:07, 3 September 2009 (UTC)[reply]

If it has a BOM then it does not consist only of ASCII characters. The BOM must not be required for the file to be handled as UTF-8, this sort of short-sightedness is stopping I18N from being implemented as too much software cannot handle garbage bytes at the start of the file but would have no problem with these bytes inside the file (such as in a quoted string constant).Spitzak (talk) 21:36, 9 November 2009 (UTC)[reply]
If the file doesn't start with a BOM, the encoding cannot be sniffed in a way that is reliable and not too wasteful. I wish we could just always assume UTF-8 by default, but there are too many files in crazy legacy encodings out there. Software that cannot handle (or even tolerate) Unicode will have to go eventually.--88.73.0.195 (talk) 23:26, 20 May 2011 (UTC)[reply]
The encoding *can* be sniffed quite well if you assume UTF-8 first, and assume legacy encodings only if it fails the UTF-8 test. For any legacy encoding that uses bytes with the high bit set, the chances of it forming valid UTF-8 are miniscule, like 2% for a 3-character file, and rapidly dropping as the file gets longer. (the only legacy encoding that does not use the high bit set that is at all popular today is ASCII, which is already identical to UTF-8). One reason I really dislike the BOM in UTF-8 files is that it discourages programmers from using this method to determine encoding. The result is that it *discourages* I18N, rather than helping it.Spitzak (talk) 03:09, 21 May 2011 (UTC)[reply]
STD 63 = RFC 3629 agrees with you, but UTF-8 processors still must be ready to accept (and ignore) a signature (formerly known as BOM, but it clearly is no BOM in UTF-8). Others, notably the XML and Unicode standards, don't agree with you, and a Wikipedia talk page anyway isn't the place to change standards. Sniffing is no good option, UTF-8 as well as windows-1252 can be plain ASCII for the first Megabytes, and end with a line containing ™ - some W3C pages do this (of course UTF-8 without signature, but still an example why sniffing is not easy). –89.204.137.230 (talk) 21:14, 10 June 2011 (UTC)[reply]
Your example of a file that is plain ASCII for the first Megabytes works perfectly with assumptions that it is UTF-8. It will be drawn correctly for all those megabytes. Eventually it will hit that character that is invalid UTF-8 and it can then decide that the encoding should be something other than UTF-8. Sorry, you have not given an example that does anything other than prove the assumption of UTF-8 is correct. I agree that software should ignore BOM bytes (not just at the start, but imbedded in the file), or treat it as whitespace.Spitzak (talk) 01:27, 11 June 2011 (UTC)[reply]
Older charsets will go away, but not this year. The IETF policy on charsets predicted 40 years, still more than 25 years until UTF-8 will have gained world dominance. At least all new Internet protocols must support UTF-8. The HTML5 folks will in essence decree that "Latin-1" (ISO 8859-1) is an alias for windows-1252, and the next HTTP RFC also addresses the problem, so for now you can at least assume that Latin-1 is almost always windows-1252. Billions of old Web pages will stay as is until their servers are shut down. –89.204.137.230 (talk) 21:25, 10 June 2011 (UTC)[reply]
I believe about 90% of the reason older character sets are not going away is due to people not assuming UTF-8 in their detectors, and that the existence of the UTF-8 BOM is the primary reason these incorrect detectors are being written (since programmers then think they should check for it). Therefore the UTF-8 BOM is the primary reason we still have legacy character sets. Web pages in windows-1252 would display correctly if browsers displayed invalid UTF-8 by translating the individual bytes to the windows-1252 characters, and there would be NO reason for any charset and the default could still be UTF-8.Spitzak (talk) 01:27, 11 June 2011 (UTC)[reply]

Code points are not bytes, but just opposite thing

[edit]

Spitzak, your summary in [2] is wrong. UTF-8 has 8-bit bytes (i.e. coded message alphabet has ≤ 256 symbols, more exactly, 243), but its code points are just Unicode ones. Surrogates are code points, they are not characters indeed. Please, recall the terminology. The word "character" is inappropriate. Incnis Mrsi (talk) 19:47, 23 March 2010 (UTC)[reply]

I believe you are right. I was confusing this with the units used to make the encoding.Spitzak (talk) 01:00, 24 March 2010 (UTC)[reply]

Historical: UTF-5 and UTF-6

[edit]

I propose to delete this section. "UTF-5" and "UTF-6" are unimplemented vaporware; they were early entries in an IDNA competition which Punycode ultimately won. Doug Ewell 20:14, 20 September 2010 (UTC) —Preceding unsigned comment added by DougEwell (talkcontribs)

IMO nothing is wrong with mentioning encodings in obsolete Internet Drafts, after all Martin Dürst is one of the top ten i18n developers I could name, and helped to develop UTF-5 before they decided to use PunyCode for their IDNA purposes. IIRC I added this section, and refrained from adding my "original research" UTF-4 ;-) –89.204.153.200 (talk) 20:35, 10 June 2011 (UTC)[reply]

0x10FFFF limit in UTF-16

[edit]

Found this in the UTF-16 RFC: "Characters with values greater than 0x10FFFF cannot be encoded in UTF-16." (http://www.ietf.org/rfc/rfc2781.txt) I'm wondering if this is a specific limit of UTF-16? Can characters above 0x10FFFF be encoding in UTF-8, for example? (Are there any such characters?) 86.186.87.85 (talk) 10:08, 30 April 2012 (UTC)[reply]

Yes this is a limit of UTF-16. UTF-8 and initially Unicode was designed to encode up to 0x7FFFFFFF. The Unicode definition was changed to match the limits of UTF-16 by basically saying any encoding of a value greater than 0x10FFFF, no matter how obvious it is, is invalid. No characters are or will ever be assigned codes above 0x10FFFF (personally I feel this is bogus, people believed 128, 2048, and 65536 were limits to character set sizes and these were all broken).Spitzak (talk) 18:10, 30 April 2012 (UTC)[reply]
UTF-32 code-point algorithm can encode up to 0xFFFFFFFF, though, making it possible to encode even higher invalid code-points. 112.134.149.92 (talk) 06:57, 6 November 2015 (UTC)[reply]

Extended Table

[edit]

UTF-16 theoretically can encode codes up to U+FFFFFFFFFF. I'm not sure if UTF-1 examples are correct, but i replaced 5-byte sequences to four. modified UTF-16:

0000000000-000000FFFF=        xxxx
0000010000-0000FFFFFF=    D8xxxxxx
0001000000-0001FFFFFF=    D9xxxxxx
0002000000-0002FFFFFF=    DAxxxxxx
0003000000-0003FFFFFF=    DBxxxxxx
0004000000-0004FFFFFF=    DCxxxxxx
0005000000-0005FFFFFF=    DDxxxxxx
0006000000-0006FFFFFF=    DExxxxxx
0007000000-FFFFFFFFFF=DFxxxxxxxxxx

The UTF-24:

00000000-00FFFFFF=      xxxxxx
01000000-FFFFFFFF=00D8xxxxxxxx

Comparison:

                  UTF-8 UTF-16 UTF-32 UTF-EBCDIC UTF-1 UTF-24
00000000-0000007F   1     2      4        1        1     3
00000080-0000009F   2     2      4        1        1     3
000000A0-000003FF   2     2      4        2        2     3
00000400-000007FF   2     2      4        3        2     3
00000800-00003FFF   3     2      4        3        2     3
00004000-00004015   3     2      4        4        2     3
00004016-0000FFFF   3     2      4        4        3     3
00010000-00038E2D   4     4      4        4        3     3
00038E2E-0003FFFF   4     4      4        4        4     3
00040000-001FFFFF   4     4      4        5        4     3
00200000-003FFFFF   5     4      4        5        4     3
00400000-006C3725   5     4      4        6        4     3
006C3726-00FFFFFF   5     4      4        6        5     3
01000000-03FFFFFF   5     4      4        6        5     6
04000000-06FFFFFF   6     4      4        7        5     6
07000000-3FFFFFFF   6     6      4        7        5     6
40000000-4E199F35   6     6      4        8        5     6
4E199F36-FFFFFFFF   6     6      4        8        6     6

164.127.155.17 (talk) 20:40, 6 March 2015 (UTC)[reply]

In your "modified UTF-16" you re-use 16-bit codewords. e.g. the 16-bit codeword 002Fhex can represent U+002F (the "slash" /) or can be part of a longer sequence. This has a lot of drawbacks, hence UTF-8 and UTF-16 are designed in a way, that such "double usage" of codewords never happened. --RokerHRO (talk) 08:10, 11 March 2016 (UTC)[reply]

i have just made this for "what if" and "for fun" interest... 2A01:119F:251:9000:6048:4CFB:87B3:44FA (talk) 17:51, 22 March 2016 (UTC)[reply]

UTF-8 Default?

[edit]

"Default" implies that "this is what you get if you don't declare the encoding." But the cited section doesn't say that. It seems to say that you should assume UTF-8 if there's no BOM. (I'm not clear what it should assume if there is a BOM.) Even if "UTF-8 is the default" is true in some technical sense, the paragraph would be clearer like this:

All XML processors must support both UTF-8 and UTF-16. If there is no encoding declaration and no byte order mark, the processor can safely assume the file is UTF-8. Thus a plain ASCII file is seen as UTF-8.

Isaac Rabinovitch (talk) 19:27, 16 April 2022 (UTC)[reply]

That is exactly what "UTF-8 default" means. If there is no indication as to what the encoding is, use UTF-8 by default.Spitzak (talk) 22:02, 16 April 2022 (UTC)[reply]
And if there is a BOM you don't need to "assume" anything, because the encoding can be inferred from how the BOM has been encoded.

Processing time misconception

[edit]

A common misconception is that there is a need to "find the nth character" and that this requires a fixed-length encoding; however, in real use the number n is only derived from examining the n−1 characters, thus sequential access is needed anyway.

Could this be clarified, please? Surely with a fixed-length encoding the nth character can be determined to be at offset (n-1)(fixed encoding length) without having to examine the preceding characters? You can do "find the nth character" in a variable-length encoding (so a fixed-length encoding isn't necessary), but it's when you use a variable-length encoding that you do need to scan preceding characters. Or are we not talking about random access (which is what "find the nth character" would suggest is wanted) here?

The point is that there are NO algorithims that need to find the n'th character where n is determined without looking at all the characters between that location and one end of the string. If you have to look at all the characters between 0 and n then the algorithim is the same speed for a variable-length as well as a fixed-length encoding (linear with n).Spitzak (talk) 01:51, 12 June 2024 (UTC)[reply]