Code_ The Hidden Language of Computer Hardware and Software - Charles Petzold [119]
Because many computer systems store characters as 8-bit values, it's possible to devise an extended ASCII character set that contains 256 characters rather than just 128. In such a character set, codes 00h through 7Fh are defined just as they are in ASCII; codes 80h through FFh can be something else entirely. This technique has been used to define additional character codes to accommodate accented letters and non-Latin alphabets. As an example, here's a 96-character extension of ASCII called the Latin Alphabet No. 1 that defines characters for codes A0h through FFh. In this table, the high-order nibble of the hexadecimal character code is shown in the top row; the low-order nibble is shown in the left column.
A-
B-
C-
D-
E-
F-
-0
°
À
Ð
à
ð
-1
¡
±
Á
Ñ
á
ñ
-2
¢
²
Â
Ò
â
ò
-3
£
³
Ã
Ó
ã
ó
-4
¤
´
Ä
Ô
ä
ô
-5
¥
µ
Å
Õ
å
õ
-6
¦
¶
Æ
Ö
æ
ö
-7
§
·
Ç
×
ç
÷
-8
¨
¸
È
Ø
è
ø
-9
©
¹
É
Ù
é
ù
-A
ª
º
Ê
Ú
ê
ú
-B
«
»
Ë
Û
ë
û
-C
¬
¼
Ì
Ü
ì
ü
-D
-
½
Í
Ý
í
ý
-E
®
¾
Î
Þ
î
þ
-F
-
¿
Ï
ß
ï
ÿ
The character for code A0h is defined as a no-break space. Usually when a computer program formats text into lines and paragraphs, it breaks each line at a space character, which is ASCII code 20h. Code A0h is supposed to be displayed as a space but can't be used for breaking a line. A no-break space might be used in the text "WW II," for example. Code ADh is defined as a soft hyphen. This is a hyphen used to separate syllables in the middle of words. It appears on the printed page only when it's necessary to break a word between two lines.
Unfortunately, many different extensions of ASCII have been defined over the decades, leading to much confusion and incompatibility. ASCII has been extended in a more radical way to encode the ideographs of Chinese, Japanese, and Korean. In one popular encoding—called Shift-JIS (Japanese Industrial Standard)—codes 81h through 9Fh actually represent the initial byte of a 2-byte character code. In this way, Shift-JIS allows for the encoding of about 6000 additional characters. Unfortunately, Shift-JIS isn't the only system that uses this technique. Three other standard double-byte character sets (DBCS) are popular in Asia.
That there are a number of incompatible double-byte character sets is only one of the problems with them. The other problem is that some characters—specifically, the normal ASCII characters—are represented by 1-byte codes, while the thousands of ideographs are represented by 2-byte codes. This makes it difficult to work with such character sets.
Under the assumption that it's preferable to have just one unambiguous character encoding system that's suitable for all the world's languages, in 1988 several major computer companies got together and began developing an alternative to ASCII known as Unicode. Whereas ASCII is a 7-bit code, Unicode is a 16-bit code. Each and every character in Unicode requires 2 bytes. That means that Unicode has character codes ranging from 0000h through FFFFh and can represent 65,536 different characters. That's enough for all the world's languages that are likely to be used in computer communication, with room for expansion.
Unicode doesn't start from scratch. The first 128 characters of Unicode—codes 0000h through 007Fh—are the same as the ASCII characters. Also, Unicode codes 00A0h through 00FFh are the same as the Latin Alphabet No. 1 extension of ASCII that I described earlier. Other worldwide standards are also incorporated into Unicode.
While Unicode may be an obvious improvement over existing character codes, that doesn't guarantee it instant acceptability. ASCII and the myriad flawed extensions of ASCII have become so entrenched in the computing world that it will be difficult to dislodge them.
The only real problem with Unicode is that it makes invalid the old equivalence between one character of text and 1 byte of storage.