C wide string handling: Difference between revisions
m WP:CHECKWIKI error 54|52|37|88|6|57|44|26|38|55|63|65|53|51|45|64|76|78|80|46|10 fixes + general fixes, removed stub tag using AWB (7879) |
|||
Line 7: | Line 7: | ||
==Wide Characters== |
==Wide Characters== |
||
C is a [[programming language]] that was developed in an environment where the dominant character set was the 7-bit [[ASCII]] code. Hence since then the 8-bit byte is the most common unit of encoding. However when a software is developed for an international purpose, it has to be able to represent different characters. For example character encoding schemes to represent the Indian, Chinese, Japanese writing systems should be available. |
C is a [[programming language]] that was developed in an environment where the dominant character set was the 7-bit [[ASCII]] code. Hence since then the 8-bit byte is the most common unit of encoding. However when a software is developed for an international purpose, it has to be able to represent more than 256 different characters. For example character encoding schemes to represent the Indian, Chinese, Japanese writing systems should be available. This can only be done by using more than one byte per character. Initial versions used variable numbers of bytes, primarily so that ASCII characters could remain using one byte per character for compatibility. |
||
Though much blame can be placed on the poor design of these original variable sized character encodings, there was also a very long history of indexing characters by using integers or pointers that were incremented by 1 without looking at the character, rather than modern constructs such as [[iterators]], and a desire to "change [[case]]" by substitution of exactly one character with exactly one other character (this actually does not work in Unicode except for some Latin languages including English). This made it difficult to adapt existing software and practices to variable-sized encodings. |
|||
The inconvenience of handling varied multibyte characters can be eliminated by using characters that are simply a uniform number of bytes, which therefore makes the string an array of a larger data type. [[ANSI C]] provides a type that allows storage of characters as uniform sized data objects called [[wide characters]].<ref>http://books.google.co.in/books?id=4Mfe4sAMFUYC&pg=PT26&lpg=PT26&dq=wide+characters+as+superset&source=bl&ots=tPLP1nN4qh&sig=f2W0ys85Ms9lRdT4HBEf_yoNL2U&hl=en&ei=H3SMTpiJFpGzrAfzvOycAg&sa=X&oi=book_result&ct=result&resnum=2&sqi=2&ved=0CCIQ6AEwAQ#v=onepage&q&f=false</ref> |
|||
==Problems== |
|||
Windows and early Unix systems defined <code>wchar_t</code> as a 16-bit unit. Linux and BSD libraries define it as a 32-bit unit in order to correctly handle [[Unicode]] (which has more than 65536 characters). This makes data files and communication using <tt>wchar_t</tt> incompatible between platforms. Generally the only solution if portable software is desired is to reimplement all these functions in the desired size so all implementations match. |
|||
The 16-bit size is now almost always used for [[UTF-16]], which is a variable-length encoding, thus removing the advantage of using <code>wchar_t</tt> over byte strings. |
|||
==Declarations and Definitions== |
==Declarations and Definitions== |
Revision as of 01:36, 18 December 2011
C standard library (libc) |
---|
General topics |
Miscellaneous headers |
C wide string handling refers to a group of functions implementing operations on wide strings in the C Standard Library. Various operations, such as copying, concatenation, tokenization and searching are supported.[1]
The only support in the C programming language itself for wide strings is that the compiler will translate a quoted wide string constant in the source into a null-terminated wide string stored in static memory.
Wide Characters
C is a programming language that was developed in an environment where the dominant character set was the 7-bit ASCII code. Hence since then the 8-bit byte is the most common unit of encoding. However when a software is developed for an international purpose, it has to be able to represent more than 256 different characters. For example character encoding schemes to represent the Indian, Chinese, Japanese writing systems should be available. This can only be done by using more than one byte per character. Initial versions used variable numbers of bytes, primarily so that ASCII characters could remain using one byte per character for compatibility.
Though much blame can be placed on the poor design of these original variable sized character encodings, there was also a very long history of indexing characters by using integers or pointers that were incremented by 1 without looking at the character, rather than modern constructs such as iterators, and a desire to "change case" by substitution of exactly one character with exactly one other character (this actually does not work in Unicode except for some Latin languages including English). This made it difficult to adapt existing software and practices to variable-sized encodings.
The inconvenience of handling varied multibyte characters can be eliminated by using characters that are simply a uniform number of bytes, which therefore makes the string an array of a larger data type. ANSI C provides a type that allows storage of characters as uniform sized data objects called wide characters.[2]
Problems
Windows and early Unix systems defined wchar_t
as a 16-bit unit. Linux and BSD libraries define it as a 32-bit unit in order to correctly handle Unicode (which has more than 65536 characters). This makes data files and communication using wchar_t incompatible between platforms. Generally the only solution if portable software is desired is to reimplement all these functions in the desired size so all implementations match.
The 16-bit size is now almost always used for UTF-16, which is a variable-length encoding, thus removing the advantage of using wchar_t over byte strings.
Declarations and Definitions
Macros
The standard header wchar.h contains the definitions or declarations of some constants.
- NULL
- It is a Null pointer constant. It never points to a real object.
- WCHAR_MIN
- It indicates the lower limit or the minimum value for the type wchar_t.
- WCHAR_MAX
- It indicates the upper limit or the maximum value for the type wchar_t.
- WEOF
- It defines the return value of the type wint_t but the value does not correspond to any member of the extended character set. WEOF indicates the end of a character stream, the end of file(EOF) or an error case.[3]
Data Types
- mbstate_t
- A variable of type mbstate_t contains all the information about the conversion state required from one call to a function to the other.
- size_t
- It is a size/count type, that stores the result or the returned value of the size of operator.
- wchar_t
- An object of type wchar_t can hold a wide character. It is also required for declaring or referencing wide characters and wide strings.
- wint_t
- This type is an integer type that can hold any value corresponding to the members of the extended character set. It can hold all values of the type wchar_t as well as the value of the macro WEOF. This type is unchanged by integral promotions.
Functions
- Wide string manipulation
wcscpy
- copies one wide string to another
wcsncpy
- writes exactly n characters to a wide string, copying from given string or adding nulls
wcscat
- appends one wide string to another
wcsncat
- appends no more than n characters from one wide string to another
wcsxfrm
- transforms a wide string according to the current locale
- Wide string examination
wcslen
- returns the length of a wide string
wcscmp
- compares two wide strings
wcsncmp
- compares a specific number of characters in two wide strings
wcscoll
- compares two wide strings according to the current locale
wcschr
- finds the first occurrence of a character in a wide string
wcsrchr
- finds the last occurrence of a character in a wide string
wcsspn
- finds in a wide string the first occurrence of a character not in a set of characters
wcscspn
- finds in a wide string the last occurrence of a character not in a set of characters
wcspbrk
- finds in a wide string the first occurrence of a character in a set of characters
wcsstr
- finds in a wide string the first occurrence of a substring
wcstok
- finds in a wide string the next occurrence of a token
- Memory manipulation
Conversion Functions
Name
Notes
wint_t btowc(int c);
returns the result after converting c into its wide character equivalent and on error returns WEOF.
int wctob(wint_t c);
returns the one byte or multibyte equivalent of c and on error returns WEOF.
References
- ^ ISO/IEC 9899:1999 specification (PDF). p. 371, § 7.24.4 "General wide string utilities". Retrieved 30 November 2011.
- ^ http://books.google.co.in/books?id=4Mfe4sAMFUYC&pg=PT26&lpg=PT26&dq=wide+characters+as+superset&source=bl&ots=tPLP1nN4qh&sig=f2W0ys85Ms9lRdT4HBEf_yoNL2U&hl=en&ei=H3SMTpiJFpGzrAfzvOycAg&sa=X&oi=book_result&ct=result&resnum=2&sqi=2&ved=0CCIQ6AEwAQ#v=onepage&q&f=false
- ^ http://publib.boulder.ibm.com/infocenter/zos/v1r12/index.jsp?topic=%2Fcom.ibm.zos.r12.bpxbd00%2Fwcharh.htm