Professional C__ - Marc Gregoire [251]
Wide Characters
The problem with viewing a character as a byte is that not all languages, or character sets, can be fully represented in 8 bits, or 1 byte. C++ has a built-in type called wchar_t that holds a wide character. Languages with non-ASCII (U.S.) characters, such as Japanese and Arabic, can be represented in C++ with wchar_t. However, the C++ standard does not define a size for wchar_t. Some compilers use 16 bits while others use 32 bits. To write portable software, it is not safe to assume that sizeof(wchar_t) is any particular numerical value.
If there is any chance that your program will be used in a non-Western character set context (hint: there is!), you should use wide characters from the beginning. When working with wchar_t, string and character literals are prefixed with the letter L to indicate that a wide-character encoding should be used. For example, to initialize a wchar_t character to be the letter m, you would write it like this:
wchar_t myWideCharacter = L'm';
There are wide-character versions of most of your favorite types and classes. The wide string class is wstring. The “prefix letter w” pattern applies to streams as well. Wide-character file output streams are handled with the wofstream, and input is handled with the wifstream. The joy of pronouncing these class names (woof-stream? whiff-stream?) is reason enough to make your programs local aware! Streams are discussed in detail in Chapter 15.
In addition to cout, cin, cerr, and clog there are wide versions of the built-in console and error streams called wcout, wcin, wcerr, and wclog. Using them is no different than using the non-wide versions:
wcout << L"I am wide-character aware." << endl;
Code snippet from WideStrings\wcout.cpp
Non-Western Character Sets
Wide characters are a great step forward because they increase the amount of space available to define a single character. The next step is to figure out how that space is used. In wide character sets, just like in ASCII, a number corresponds to a particular glyph. The only difference is that each number does not fit in 8 bits. The map of characters to numbers (now called code points) is quite a bit larger because it handles many different character sets in addition to the characters that English-speaking programmers are familiar with.
The Universal Character Set (UCS), defined by the International Standard ISO 10646, and Unicode are both standardized sets of characters. They contain around one hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point. The same characters with the same numbers exist in both standards. Both have specific encodings that you can use. For example, UTF-8 is an example of a Unicode encoding where Unicode characters are encoded using one to four 8-bit bytes. UTF-16 encodes Unicode characters as one or two 16-bit values and UTF-32 encodes Unicode characters as exactly 32 bits.
Different applications can use different encodings. Unfortunately, the C++ standard does not specify a size for wide characters (wchar_t). On Windows it is 16 bits, while on other platforms it could be 32 bits. You need to be aware of this when using wide characters for character encoding in cross platform code. To help solve this issue, C++11 introduces two new character types: char16_t and char32_t. The following list gives an overview of all character types supported by C++11:
char: Stores 8 bits. Can be used to store ASCII characters, or as a basic building block for storing UTF-8 encoded Unicode characters, where one Unicode characters is encoded as one to four chars.
char16_t: Stores 16 bits. Can be used as the basic building block for UTF-16 encoded Unicode characters where one Unicode character is encoded as one or two char16_ts.
char32_t: Stores 32 bits. Can be used for storing