Learning Python - Mark Lutz [484]
Even if you fall into the last of the three categories just mentioned, though, a basic understanding of 3.0’s string model can help both to demystify some of the underlying behavior now, and to make mastering Unicode or binary data issues easier if they impact you in the future.
Python 3.0’s support for Unicode and binary data is also available in 2.6, albeit in different forms. Although our main focus in this chapter is on string types in 3.0, we’ll explore some 2.6 differences along the way too. Regardless of which version you use, the tools we’ll explore here can become important in many types of programs.
String Basics
Before we look at any code, let’s begin with a general overview of Python’s string model. To understand why 3.0 changed the way it did on this front, we have to start with a brief look at how characters are actually represented in computers.
Character Encoding Schemes
Most programmers think of strings as series of characters used to represent textual data. The way characters are stored in a computer’s memory can vary, though, depending on what sort of character set must be recorded.
The ASCII standard was created in the U.S., and it defines many U.S. programmers’ notion of text strings. ASCII defines character codes from 0 through 127 and allows each character to be stored in one 8-bit byte (only 7 bits of which are actually used). For example, the ASCII standard maps the character 'a' to the integer value 97 (0x61 in hex), which is stored in a single byte in memory and files. If you wish to see how this works, Python’s ord built-in function gives the binary value for a character, and chr returns the character for a given integer code value:
>>> ord('a') # 'a' is a byte with binary value 97 in ASCII
97
>>> hex(97)
'0x61'
>>> chr(97) # Binary value 97 stands for character 'a'
'a'
Sometimes one byte per character isn’t enough, though. Various symbols and accented characters, for instance, do not fit into the range of possible characters defined by ASCII. To accommodate special characters, some standards allow all possible values in an 8-bit byte, 0 through 255, to represent characters, and assign the values 128 through 255 (outside ASCII’s range) to special characters. One such standard, known as Latin-1, is widely used in Western Europe. In Latin-1, character codes above 127 are assigned to accented and otherwise special characters. The character assigned to byte value 196, for example, is a specially marked non-ASCII character:
>>> 0xC4
196
>>> chr(196)
'Ä'
This standard allows for a wide array of extra special characters. Still, some alphabets define so many characters that it is impossible to represent each of them as one byte. Unicode allows more flexibility. Unicode text is commonly referred to as “wide-character” strings, because each character may be represented with multiple bytes. Unicode is typically used in internationalized programs, to represent European and Asian character sets that have more characters than 8-bit bytes can represent.
To store such rich text in computer memory, we say that characters are translated to and from raw bytes using an encoding—the rules for translating a string of Unicode characters into a sequence of bytes, and extracting a string from a sequence of bytes. More procedurally, this translation back and forth between bytes and strings is defined by two terms:
Encoding is the process of translating a string of characters into its raw bytes form, according to a desired encoding name.
Decoding is the process of translating a raw string of bytes into is character string form, according to its encoding name.
That is, we encode from string to raw bytes, and decode from raw bytes to string. For some encodings, the translation process is trivial—ASCII