Wide character: Difference between revisions
I have found this reference which mentioned very well about wchar_t. Please check this reference. |
m WP:CHECKWIKI error fix for #61. Punctuation goes before References. Do general fixes if a problem exists. - using AWB (10822) |
||
Line 24: | Line 24: | ||
==Programming specifics== |
==Programming specifics== |
||
===C/C++=== |
===C/C++=== |
||
The [[C Standard Library|C]] and [[C++ Standard Library|C++]] standard libraries include [[C string handling|a number of facilities]] for dealing with wide characters and strings composed of them. The wide characters are defined using datatype <code>wchar_t</code> |
The [[C Standard Library|C]] and [[C++ Standard Library|C++]] standard libraries include [[C string handling|a number of facilities]] for dealing with wide characters and strings composed of them. The wide characters are defined using datatype <code>wchar_t</code>,<ref>{{cite web |
||
| title=C++ Wide character <code>wchar_t</code> |
| title=C++ Wide character <code>wchar_t</code> |
||
| url=https://www.tutorialcup.com/cplusplus/strings.htm#wchar_t |
| url=https://www.tutorialcup.com/cplusplus/strings.htm#wchar_t |
||
| publisher=TutorialCup.com |
| publisher=TutorialCup.com |
||
| accessdate=17 February 2015 |
| accessdate=17 February 2015 |
||
}}</ref> |
}}</ref> which in the original C90 standard was defined as |
||
: "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5) |
: "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5) |
||
Line 56: | Line 57: | ||
{{use dmy dates|date=January 2012}} |
{{use dmy dates|date=January 2012}} |
||
{{DEFAULTSORT:Wide Character}} |
{{DEFAULTSORT:Wide Character}} |
||
[[Category:Character encoding]] |
[[Category:Character encoding]] |
Revision as of 09:18, 18 February 2015
This article needs additional citations for verification. (February 2011) |
A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.
History
During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit byte as their smallest datatype. The 7-bit ASCII character set became the industry standard method for encoding alphanumeric characters for teletype machines and computer terminals. The extra bit was used for parity, to ensure the integrity of data storage and transmission. As a result, the 8-bit byte became the de facto datatype for computer systems storing ASCII characters in memory.
Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of English alphabet characters. 8-bit extensions such as IBM code page 37, PETSCII and ISO 8859 became commonplace, offering terminal support for Greek, Cyrillic, and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set.
In 1989, the International Organization for Standardization began work on the Universal Character Set (UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.
Relation to UCS and Unicode
A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and Unicode simply being two common character sets that contain more characters than an 8-bit value would allow.
Relation to multibyte characters
Just as earlier data transmission systems suffered from the lack of an 8-bit clean data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as UTF-8 that can use multiple bytes to encode a value that is too large for a single 8-bit symbol.
The C standard distinguishes between multibyte encodings of characters, which use a fixed or variable number of bytes to represent each character (primarily used in source code and external files), from wide characters, which are run-time representations of characters in single objects (typically, greater than 8 bits).
Size of a wide character
UTF-16 little-endian is the encoding standard at Microsoft (and in the Windows operating system). Yet with surrogate pairs it supports 32-bit as well.[1] The .Net Framework platform supports multiple wide-character implementations including UTF7, UTF8, UTF16 and UTF32.[2]
The Java platform requires that wide character variables be defined as 16-bit values, and that characters be encoded using UTF-16 (due to former use of UCS-2), while modern Unix-like systems generally require 32-bit values encoded using UTF-32.[citation needed]
Programming specifics
C/C++
The C and C++ standard libraries include a number of facilities for dealing with wide characters and strings composed of them. The wide characters are defined using datatype wchar_t
,[3] which in the original C90 standard was defined as
- "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5)
Both C and C++ introduced fixed-size character types char16_t
and char32_t
in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit Unicode transformation formats, leaving wchar_t
implementation-defined. The ISO/IEC 10646:2003 Unicode standard 4.0 says that:
- "The width of
wchar_t
is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not usewchar_t
for storing Unicode text. Thewchar_t
type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."
Python
According to Python's documentation, the language sometimes uses wchar_t
as the basis for its character type Py_UNICODE
. It depends on whether wchar_t
is "compatible with the chosen Python Unicode build variant" on that system.[4]
References
- ^ http://msdn.microsoft.com/en-us/goglobal/bb688113.aspx
- ^ https://msdn.microsoft.com/en-us/library/System.Text.aspx
- ^ "C++ Wide character
wchar_t
". TutorialCup.com. Retrieved 17 February 2015. - ^ https://docs.python.org/c-api/unicode.html accessed 2009 12 19
External links
- The Unicode Standard, Version 4.0 - online edition
- C Wide Character Functions @ Java2S
- Java Unicode Functions @ Java2S
- Multibyte (3) Man Page @ FreeBSD.org
- Multibyte and Wide Characters @ Microsoft Developer Network
- Windows Character Sets @ Microsoft Developer Network
- Unicode and Character Set Programming Reference @ Microsoft Developer Network
- STR33-C. Size wide character strings correctly @ Secure Coding