Wide character
Wide character is a computer programming term. It is a vague term used to represent a datatype that is richer than the traditional (8-bit) characters. It is not the same thing as Unicode.the variable of this type store 2 bytes character ode with value in the range from 0 to 65,536
wchar_t
is a data type in ANSI/ISO C, ANSI/ISO C++, and some other programming languages that is intended to represent wide characters.
The Unicode standard 4.0 says that
- "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension."
and that
- "The width of
wchar_t
is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not usewchar_t
for storing Unicode text. Thewchar_t
type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."
Under Win32, wchar_t
is 16 bits wide and represents a UTF-16 code unit. On Unix-like systems wchar_t
is commonly 32 bits wide and represents a UTF-32 code unit.
In ANSI C library header files, <wchar.h> and <wctype.h> deal with the wide characters.
Python
According to the Python documentation, Python sometimes uses wchar_t as the basis for it's character type Py_UNICODE. It depends on whether wchar_t is "compatible with the chosen Python Unicode build variant" on that system. [1]
Syntax
wchar_t variable_name = 'value'; // storing a 16bit character code and value should be either in hexadecimal or single ACII character
wchar_t variable_name('value');
Functions
There are several functions in C's stdlib.h to help with wchar_t's.
- wctomb() - wide character to multibyte character [2]
- mbtowc() - multibyte character to wide char [3]
- wcstombs() - wide-char string to multibyte character string [4]
- mbstowcs() - multibyte character string to wide-char string [5]
- mblen() - number of bytes in a multibyte character [6]
The author of GNU libc advises to avoid these due to the 'state' mechanism they involve, and instead suggests the 'restartable' mbsrtowcs et al functions. [7]
External links
Notes
- ^ http://docs.python.org/c-api/unicode.html accessed 2009 12 19
- ^ C++ Resources Network - wctomb, access 2009 12 15
- ^ C++ Resources Network - mbtowc, access 2009 12 15
- ^ C++ Resources Network - wcstombs, access 2009 12 15
- ^ C++ Resources Network - mbstowcs, access 2009 12 15
- ^ C++ Resources Network - mblen, access 2009 12 15
- ^ GNU.org libc source code, libc/stdlib/mbstowcs.c, accessed 2009 12 15