Jump to content

C wide string handling: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Yobot (talk | contribs)
m WP:CHECKWIKI error 54|52|37|88|6|57|44|26|38|55|63|65|53|51|45|64|76|78|80|46|10 fixes + general fixes, removed stub tag using AWB (7879)
Line 7: Line 7:


==Wide Characters==
==Wide Characters==
C is a [[programming language]] that was developed in an environment where the dominant character set was the 7-bit [[ASCII]] code. Hence since then the 8-bit byte is the most common unit of encoding. However when a software is developed for an international purpose, it has to be able to represent different characters. For example character encoding schemes to represent the Indian, Chinese, Japanese writing systems should be available. The inconvenience of handling such varied multibyte characters can be eliminated by using characters that are simply a uniform number of bytes. [[ANSI C]] provides a type that allows manipulation of variable width characters as uniform sized data objects called [[wide characters]]. The wide character set is a [[superset]] of already existing character sets, including the 7-bit ASCII.<ref>http://books.google.co.in/books?id=4Mfe4sAMFUYC&pg=PT26&lpg=PT26&dq=wide+characters+as+superset&source=bl&ots=tPLP1nN4qh&sig=f2W0ys85Ms9lRdT4HBEf_yoNL2U&hl=en&ei=H3SMTpiJFpGzrAfzvOycAg&sa=X&oi=book_result&ct=result&resnum=2&sqi=2&ved=0CCIQ6AEwAQ#v=onepage&q&f=false</ref>
C is a [[programming language]] that was developed in an environment where the dominant character set was the 7-bit [[ASCII]] code. Hence since then the 8-bit byte is the most common unit of encoding. However when a software is developed for an international purpose, it has to be able to represent more than 256 different characters. For example character encoding schemes to represent the Indian, Chinese, Japanese writing systems should be available. This can only be done by using more than one byte per character. Initial versions used variable numbers of bytes, primarily so that ASCII characters could remain using one byte per character for compatibility.

Though much blame can be placed on the poor design of these original variable sized character encodings, there was also a very long history of indexing characters by using integers or pointers that were incremented by 1 without looking at the character, rather than modern constructs such as [[iterators]], and a desire to "change [[case]]" by substitution of exactly one character with exactly one other character (this actually does not work in Unicode except for some Latin languages including English). This made it difficult to adapt existing software and practices to variable-sized encodings.

The inconvenience of handling varied multibyte characters can be eliminated by using characters that are simply a uniform number of bytes, which therefore makes the string an array of a larger data type. [[ANSI C]] provides a type that allows storage of characters as uniform sized data objects called [[wide characters]].<ref>http://books.google.co.in/books?id=4Mfe4sAMFUYC&pg=PT26&lpg=PT26&dq=wide+characters+as+superset&source=bl&ots=tPLP1nN4qh&sig=f2W0ys85Ms9lRdT4HBEf_yoNL2U&hl=en&ei=H3SMTpiJFpGzrAfzvOycAg&sa=X&oi=book_result&ct=result&resnum=2&sqi=2&ved=0CCIQ6AEwAQ#v=onepage&q&f=false</ref>

==Problems==

Windows and early Unix systems defined <code>wchar_t</code> as a 16-bit unit. Linux and BSD libraries define it as a 32-bit unit in order to correctly handle [[Unicode]] (which has more than 65536 characters). This makes data files and communication using <tt>wchar_t</tt> incompatible between platforms. Generally the only solution if portable software is desired is to reimplement all these functions in the desired size so all implementations match.

The 16-bit size is now almost always used for [[UTF-16]], which is a variable-length encoding, thus removing the advantage of using <code>wchar_t</tt> over byte strings.


==Declarations and Definitions==
==Declarations and Definitions==

Revision as of 01:36, 18 December 2011

C wide string handling refers to a group of functions implementing operations on wide strings in the C Standard Library. Various operations, such as copying, concatenation, tokenization and searching are supported.[1]

The only support in the C programming language itself for wide strings is that the compiler will translate a quoted wide string constant in the source into a null-terminated wide string stored in static memory.

Wide Characters

C is a programming language that was developed in an environment where the dominant character set was the 7-bit ASCII code. Hence since then the 8-bit byte is the most common unit of encoding. However when a software is developed for an international purpose, it has to be able to represent more than 256 different characters. For example character encoding schemes to represent the Indian, Chinese, Japanese writing systems should be available. This can only be done by using more than one byte per character. Initial versions used variable numbers of bytes, primarily so that ASCII characters could remain using one byte per character for compatibility.

Though much blame can be placed on the poor design of these original variable sized character encodings, there was also a very long history of indexing characters by using integers or pointers that were incremented by 1 without looking at the character, rather than modern constructs such as iterators, and a desire to "change case" by substitution of exactly one character with exactly one other character (this actually does not work in Unicode except for some Latin languages including English). This made it difficult to adapt existing software and practices to variable-sized encodings.

The inconvenience of handling varied multibyte characters can be eliminated by using characters that are simply a uniform number of bytes, which therefore makes the string an array of a larger data type. ANSI C provides a type that allows storage of characters as uniform sized data objects called wide characters.[2]

Problems

Windows and early Unix systems defined wchar_t as a 16-bit unit. Linux and BSD libraries define it as a 32-bit unit in order to correctly handle Unicode (which has more than 65536 characters). This makes data files and communication using wchar_t incompatible between platforms. Generally the only solution if portable software is desired is to reimplement all these functions in the desired size so all implementations match.

The 16-bit size is now almost always used for UTF-16, which is a variable-length encoding, thus removing the advantage of using wchar_t over byte strings.

Declarations and Definitions

Macros

The standard header wchar.h contains the definitions or declarations of some constants.

NULL
It is a Null pointer constant. It never points to a real object.
WCHAR_MIN
It indicates the lower limit or the minimum value for the type wchar_t.
WCHAR_MAX
It indicates the upper limit or the maximum value for the type wchar_t.
WEOF
It defines the return value of the type wint_t but the value does not correspond to any member of the extended character set. WEOF indicates the end of a character stream, the end of file(EOF) or an error case.[3]

Data Types

mbstate_t
A variable of type mbstate_t contains all the information about the conversion state required from one call to a function to the other.
size_t
It is a size/count type, that stores the result or the returned value of the size of operator.
wchar_t
An object of type wchar_t can hold a wide character. It is also required for declaring or referencing wide characters and wide strings.
wint_t
This type is an integer type that can hold any value corresponding to the members of the extended character set. It can hold all values of the type wchar_t as well as the value of the macro WEOF. This type is unchanged by integral promotions.

Functions

Wide string manipulation
  • wcscpy - copies one wide string to another
  • wcsncpy - writes exactly n characters to a wide string, copying from given string or adding nulls
  • wcscat - appends one wide string to another
  • wcsncat - appends no more than n characters from one wide string to another
  • wcsxfrm - transforms a wide string according to the current locale
Wide string examination
  • wcslen - returns the length of a wide string
  • wcscmp - compares two wide strings
  • wcsncmp - compares a specific number of characters in two wide strings
  • wcscoll - compares two wide strings according to the current locale
  • wcschr - finds the first occurrence of a character in a wide string
  • wcsrchr - finds the last occurrence of a character in a wide string
  • wcsspn - finds in a wide string the first occurrence of a character not in a set of characters
  • wcscspn - finds in a wide string the last occurrence of a character not in a set of characters
  • wcspbrk - finds in a wide string the first occurrence of a character in a set of characters
  • wcsstr - finds in a wide string the first occurrence of a substring
  • wcstok - finds in a wide string the next occurrence of a token
Memory manipulation
  • wmemset - fills a buffer with a repeated wide character
  • wmemcpy - copies one buffer to another
  • wmemmove - copies one buffer to another, possibly overlapping, buffer
  • wmemcmp - compares two buffers
  • wmemchr - finds the first occurrence of a wide character in a buffer

Conversion Functions

Name Notes
wint_t btowc(int c); returns the result after converting c into its wide character equivalent and on error returns WEOF.
int wctob(wint_t c); returns the one byte or multibyte equivalent of c and on error returns WEOF.


References

  1. ^ ISO/IEC 9899:1999 specification (PDF). p. 371, § 7.24.4 "General wide string utilities". Retrieved 30 November 2011.
  2. ^ http://books.google.co.in/books?id=4Mfe4sAMFUYC&pg=PT26&lpg=PT26&dq=wide+characters+as+superset&source=bl&ots=tPLP1nN4qh&sig=f2W0ys85Ms9lRdT4HBEf_yoNL2U&hl=en&ei=H3SMTpiJFpGzrAfzvOycAg&sa=X&oi=book_result&ct=result&resnum=2&sqi=2&ved=0CCIQ6AEwAQ#v=onepage&q&f=false
  3. ^ http://publib.boulder.ibm.com/infocenter/zos/v1r12/index.jsp?topic=%2Fcom.ibm.zos.r12.bpxbd00%2Fwcharh.htm