Jump to content

Wide character: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Reckjavik (talk | contribs)
No edit summary
 
(98 intermediate revisions by 70 users not shown)
Line 1: Line 1:
{{Short description|Data type}}
'''Wide character''' is a computer programming term. It is a vague term used to represent a [[datatype]] that is richer than the traditional (8-bit) characters. It is not the same thing as [[Unicode]].the variable of this type store 2 bytes character ode with value in the range from 0 to 65,536
{{For|double-wide (CJK ideograph-sized) variations of ASCII characters|halfwidth and fullwidth forms}}
{{More citations needed|date=February 2011}}
{{Use dmy dates|date=June 2021}}
A '''wide character''' is a computer [[character (computing)|character]] [[datatype]] that generally has a size greater than the traditional [[8-bit]] character. The increased datatype size allows for the use of larger coded [[Character encoding|character sets]].


==History==
<code>wchar_t</code> is a data type in ANSI/ISO [[C (programming language)|C]], ANSI/ISO [[C++]], and some other [[programming language]]s that is intended to represent wide characters.
During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit [[byte]] as their smallest datatype. The 7-bit [[ASCII]] character set became the industry standard method for encoding [[alphanumeric]] characters for [[teleprinter|teletype machines]] and [[computer terminal]]s. The extra bit was used for parity, to ensure the integrity of data storage and transmission. As a result, the 8-bit byte became the [[de facto]] datatype for computer systems storing ASCII characters in memory.


Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of [[English alphabet]] characters. [[Extended ASCII|8-bit extensions]] such as IBM code page 37, [[PETSCII]] and [[ISO/IEC 8859|ISO 8859]] became commonplace, offering terminal support for [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set.
The [[Unicode]] standard 4.0 says that


In 1989, the [[International Organization for Standardization]] began work on the [[Universal Coded Character Set|Universal Character Set]] (UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.
:"ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the [[portable character set|portable C execution set]] correspond to their wide character equivalents by zero extension."


==Relation to UCS and Unicode==
and that
A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and [[Unicode]] simply being two common character sets that encode more characters than an 8-bit wide numeric value (255 total) would allow.


==Relation to multibyte characters==
:"The width of <code>wchar_t</code> is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use <code>wchar_t</code> for storing Unicode text. The <code>wchar_t</code> type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."
Just as earlier data transmission systems suffered from the lack of an [[8-bit clean]] data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as [[UTF-8]] that can use [[variable-width encoding|multiple bytes]] to encode a value that is too large for a single 8-bit symbol.


The [[C (programming language)|C]] standard distinguishes between ''multibyte'' encodings of characters, which use a fixed or variable number of bytes to represent each character (primarily used in source code and external files), from ''wide characters'', which are [[Run time (program lifecycle phase)|run-time]] representations of characters in single objects (typically, greater than 8 bits).
Under [[Win32]], <code>wchar_t</code> is 16 bits wide and represents a [[UTF-16]] code unit. On [[Unix]]-like systems <code>wchar_t</code> is commonly 32 bits wide and represents a [[UTF-32]] code unit.


==Size of a wide character==
In [[C standard library|ANSI C library header files]], &lt;[[Wchar.h|wchar.h]]&gt; and &lt;[[Wctype.h|wctype.h]]&gt; deal with the wide characters.
Early adoption of [[UCS-2]] ("Unicode 1.0") led to common use of [[UTF-16]] in a number of platforms, most notably [[Microsoft Windows]], [[.NET]] and [[Java (software platform)|Java]]. In these systems, it is common to have a "wide character" ({{code|wchar_t}} in C/C++; {{code|char}} in Java) type of 16-bits. These types do not always map directly to one "character", as [[surrogate pairs]] are required to store the full range of Unicode (1996, Unicode 2.0).<ref>{{cite web |url=http://msdn.microsoft.com/en-us/goglobal/bb688113.aspx |title=Globalization Step-by-Step: Unicode Enabled |website=msdn.microsoft.com |url-status=dead |archive-url=https://web.archive.org/web/20090101025155/http://msdn.microsoft.com/en-us/goglobal/bb688113.aspx |archive-date=2009-01-01}}</ref><ref>{{cite web |title=String Class (System) |url=https://learn.microsoft.com/en-us/dotnet/enwiki/api/system.string?view=net-7.0 |website=learn.microsoft.com |language=en-us}}</ref><ref>{{cite web |title=Primitive Data Types (The Java™ Tutorials > Learning the Java Language > Language Basics) |url=https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html |website=docs.oracle.com}}</ref>


[[Unix-like]] generally use a 32-bit {{code|wchar_t}} to fit the 21-bit Unicode code point, as C90 prescribed.<ref>{{cite web |title=Null-terminated wide strings <wctype.h> - cppreference.com |url=https://en.cppreference.com/enwiki/w/c/string/wide |website=en.cppreference.com}}</ref>
''' Python '''


The size of a wide character type does not dictate what kind of text encodings a system can process, as conversions are available. (Old conversion code commonly overlook surrogates, however.) The historical circumstances of their adoption does also decide what types of encoding they ''prefer''. A system influenced by Unicode 1.0, such as Windows, tends to mainly use "wide strings" made out of wide character units. Other systems such as the Unix-likes, however, tend to retain the 8-bit "narrow string" convention, using a multibyte encoding (almost universally UTF-8) to handle "wide" characters.<ref>{{cite web |title=UTF-8 Everywhere |url=http://utf8everywhere.org/ |quote=In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, such as the Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).}}</ref>
According to the Python documentation, [[Python_(programming_language) | Python]] sometimes uses wchar_t as the basis for it's character type Py_UNICODE. It depends on whether wchar_t is "compatible with the chosen Python Unicode build variant" on that system. <ref>http://docs.python.org/c-api/unicode.html accessed 2009 12 19</ref>


==Programming specifics==


===C/C++===
''' Syntax '''
The [[C Standard Library|C]] and [[C++ Standard Library|C++]] standard libraries include [[C string handling|a number of facilities]] for dealing with wide characters and strings composed of them. The wide characters are defined using datatype <code>wchar_t</code>, which in the original [[C90 (C version)|C90]] standard was defined as


: "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5)
<b>wchar_t variable_name = 'value';</b> // storing a 16bit character code and value should be either in hexadecimal or single ASCII character <br />
<b>wchar_t variable_name('value');</b>


Both C and [[C++]] introduced fixed-size character types <code>char16_t</code> and <code>char32_t</code> in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit [[Unicode]] transformation formats, leaving <code>wchar_t</code> implementation-defined. The ISO/IEC 10646:2003 [[Unicode]] standard 4.0 says that:


:"The width of <code>wchar_t</code> is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use <code>wchar_t</code> for storing Unicode text. The <code>wchar_t</code> type is intended for storing compiler-defined wide characters, which may be [[Unicode]] characters in some compilers."<ref>{{Cite book|url=https://www.worldcat.org/oclc/52257637|title=The Unicode standard|date=2003|publisher=Addison-Wesley|others=Aliprand, Joan., Unicode Consortium.|isbn=0-321-18578-1|edition=Version 4.0|location=Boston|pages=109|chapter=5.2 ANSI/ISO C wchar_t|oclc=52257637}}</ref>
==Functions==


===Python===
There are several functions in C's [[stdlib.h]] to help with wchar_t's.
According to [[Python (programming language)|Python]] 2.7's documentation, the language sometimes uses <code>wchar_t</code> as the basis for its character type <code>Py_UNICODE</code>. It depends on whether <code>wchar_t</code> is "compatible with the chosen Python Unicode build variant" on that system.<ref>{{cite web |url=https://docs.python.org/2.7/c-api/unicode.html |title=Unicode Objects and Codecs — Python 2.7 documentation |website=docs.python.org |access-date=2009-12-19}}</ref> This distinction has been deprecated since Python 3.3, which introduced a flexibly-sized UCS1/2/4 storage for strings and formally aliased {{code|Py_UNICODE}} to <code>wchar_t</code>.<ref>{{cite web |url=https://docs.python.org/3.10/c-api/unicode.htm|title=Unicode Objects and Codecs — Python 3.10.10 documentation |website=docs.python.org |access-date=2023-02-18}}</ref> Since Python 3.12 use of <code>wchar_t</code>, i.e. the <code>Py_UNICODE</code> [[typedef]], for Python strings (wstr in implementation) has been dropped and still as before an "[[UTF-8]] representation is created on demand and cached in the Unicode object."<ref>{{Cite web |title=Unicode Objects and Codecs |url=https://docs.python.org/3.12/c-api/unicode.html |access-date=2023-09-09 |website=Python documentation}}</ref>


==References==
* wctomb() - wide character to [[multibyte character]] <ref>[http://www.cplusplus.com/reference/clibrary/cstdlib/mbtowc/ C++ Resources Network - wctomb], access 2009 12 15</ref>
{{reflist}}
* mbtowc() - [[multibyte character]] to wide char <ref>[http://www.cplusplus.com/reference/clibrary/cstdlib/mbtowc/ C++ Resources Network - mbtowc], access 2009 12 15</ref>
* wcstombs() - wide-char string to [[multibyte character]] string <ref>[http://www.cplusplus.com/reference/clibrary/cstdlib/mbtowc/ C++ Resources Network - wcstombs], access 2009 12 15</ref>
* mbstowcs() - [[multibyte character]] string to wide-char string <ref>[http://www.cplusplus.com/reference/clibrary/cstdlib/mbtowc/ C++ Resources Network - mbstowcs], access 2009 12 15</ref>
* mblen() - number of bytes in a [[multibyte character]] <ref>[http://www.cplusplus.com/reference/clibrary/cstdlib/mbtowc/ C++ Resources Network - mblen], access 2009 12 15</ref>

The author of [[GNU]] [[libc]] advises to avoid these due to the 'state' mechanism they involve, and instead suggests the 'restartable' mbsrtowcs et al functions. <ref>[http://cvs.savannah.gnu.org/viewvc/libc/stdlib/mbstowcs.c?revision=1.9&root=libc&view=markup GNU.org libc source code, libc/stdlib/mbstowcs.c], accessed 2009 12 15</ref>


==External links==
==External links==
{{Sister project links}}
{{sisterlinks}}
* [http://unicode.org/versions/Unicode4.0.0/ch05.pdf The Unicode Standard, Version 4.0 - online edition]
* [https://unicode.org/versions/Unicode4.0.0/ch05.pdf The Unicode Standard, Version 4.0 - online edition]
* [https://www.java2s.com/Tutorial/C/0300__Wide-Character-String/WideCharacterFunctions.htm C Wide Character Functions @ Java2S]
== Notes ==
* [https://www.java2s.com/Tutorial/Java/0120__Development/0240__Unicode.htm Java Unicode Functions @ Java2S]

* [https://www.freebsd.org/cgi/man.cgi?query=multibyte&apropos=0&sektion=0&format=html Multibyte (3) Man Page @ FreeBSD.org]
{{reflist}}
* [https://msdn.microsoft.com/en-us/library/z207t55f%28VS.100%29.aspx Multibyte and Wide Characters @ Microsoft Developer Network]
* [https://msdn.microsoft.com/en-us/library/dd317743%28VS.85%29.aspx Windows Character Sets @ Microsoft Developer Network]
* [https://msdn.microsoft.com/en-us/library/dd374087%28VS.85%29.aspx Unicode and Character Set Programming Reference @ Microsoft Developer Network]
* [https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf Keep multibyte character support simple @ EuroBSDCon, Beograd, September 25, 2016]


{{DEFAULTSORT:Wide Character}}
{{DEFAULTSORT:Wide Character}}
[[Category:Character encoding]]
[[Category:Character encoding]]
[[Category:C programming language]]
[[Category:C (programming language)]]
[[Category:C++]]
[[Category:C++]]


{{compu-lang-stub}}

[[ja:ワイド文字]]
[[ru:Широкий символ]]
[[zh:寬字元]]

Latest revision as of 17:06, 9 September 2023

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

History

[edit]

During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit byte as their smallest datatype. The 7-bit ASCII character set became the industry standard method for encoding alphanumeric characters for teletype machines and computer terminals. The extra bit was used for parity, to ensure the integrity of data storage and transmission. As a result, the 8-bit byte became the de facto datatype for computer systems storing ASCII characters in memory.

Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of English alphabet characters. 8-bit extensions such as IBM code page 37, PETSCII and ISO 8859 became commonplace, offering terminal support for Greek, Cyrillic, and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set.

In 1989, the International Organization for Standardization began work on the Universal Character Set (UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.

Relation to UCS and Unicode

[edit]

A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and Unicode simply being two common character sets that encode more characters than an 8-bit wide numeric value (255 total) would allow.

Relation to multibyte characters

[edit]

Just as earlier data transmission systems suffered from the lack of an 8-bit clean data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as UTF-8 that can use multiple bytes to encode a value that is too large for a single 8-bit symbol.

The C standard distinguishes between multibyte encodings of characters, which use a fixed or variable number of bytes to represent each character (primarily used in source code and external files), from wide characters, which are run-time representations of characters in single objects (typically, greater than 8 bits).

Size of a wide character

[edit]

Early adoption of UCS-2 ("Unicode 1.0") led to common use of UTF-16 in a number of platforms, most notably Microsoft Windows, .NET and Java. In these systems, it is common to have a "wide character" (wchar_t in C/C++; char in Java) type of 16-bits. These types do not always map directly to one "character", as surrogate pairs are required to store the full range of Unicode (1996, Unicode 2.0).[1][2][3]

Unix-like generally use a 32-bit wchar_t to fit the 21-bit Unicode code point, as C90 prescribed.[4]

The size of a wide character type does not dictate what kind of text encodings a system can process, as conversions are available. (Old conversion code commonly overlook surrogates, however.) The historical circumstances of their adoption does also decide what types of encoding they prefer. A system influenced by Unicode 1.0, such as Windows, tends to mainly use "wide strings" made out of wide character units. Other systems such as the Unix-likes, however, tend to retain the 8-bit "narrow string" convention, using a multibyte encoding (almost universally UTF-8) to handle "wide" characters.[5]

Programming specifics

[edit]

C/C++

[edit]

The C and C++ standard libraries include a number of facilities for dealing with wide characters and strings composed of them. The wide characters are defined using datatype wchar_t, which in the original C90 standard was defined as

"an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5)

Both C and C++ introduced fixed-size character types char16_t and char32_t in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit Unicode transformation formats, leaving wchar_t implementation-defined. The ISO/IEC 10646:2003 Unicode standard 4.0 says that:

"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."[6]

Python

[edit]

According to Python 2.7's documentation, the language sometimes uses wchar_t as the basis for its character type Py_UNICODE. It depends on whether wchar_t is "compatible with the chosen Python Unicode build variant" on that system.[7] This distinction has been deprecated since Python 3.3, which introduced a flexibly-sized UCS1/2/4 storage for strings and formally aliased Py_UNICODE to wchar_t.[8] Since Python 3.12 use of wchar_t, i.e. the Py_UNICODE typedef, for Python strings (wstr in implementation) has been dropped and still as before an "UTF-8 representation is created on demand and cached in the Unicode object."[9]

References

[edit]
  1. ^ "Globalization Step-by-Step: Unicode Enabled". msdn.microsoft.com. Archived from the original on 1 January 2009.
  2. ^ "String Class (System)". learn.microsoft.com.
  3. ^ "Primitive Data Types (The Java™ Tutorials > Learning the Java Language > Language Basics)". docs.oracle.com.
  4. ^ "Null-terminated wide strings <wctype.h> - cppreference.com". en.cppreference.com.
  5. ^ "UTF-8 Everywhere". In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, such as the Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
  6. ^ "5.2 ANSI/ISO C wchar_t". The Unicode standard. Aliprand, Joan., Unicode Consortium. (Version 4.0 ed.). Boston: Addison-Wesley. 2003. p. 109. ISBN 0-321-18578-1. OCLC 52257637.{{cite book}}: CS1 maint: others (link)
  7. ^ "Unicode Objects and Codecs — Python 2.7 documentation". docs.python.org. Retrieved 19 December 2009.
  8. ^ "Unicode Objects and Codecs — Python 3.10.10 documentation". docs.python.org. Retrieved 18 February 2023.
  9. ^ "Unicode Objects and Codecs". Python documentation. Retrieved 9 September 2023.
[edit]