Learning Python - Mark Lutz [485]
The widely used UTF-8 encoding, for example, allows a wide range of characters to be represented by employing a variable number of bytes scheme. Character codes less than 128 are represented as a single byte; codes between 128 and 0x7ff (2047) are turned into two bytes, where each byte has a value between 128 and 255; and codes above 0x7ff are turned into three- or four-byte sequences having values between 128 and 255. This keeps simple ASCII strings compact, sidesteps byte ordering issues, and avoids null (zero) bytes that can cause problems for C libraries and networking.
Because encodings’ character maps assign characters to the same codes for compatibility, ASCII is a subset of both Latin-1 and UTF-8; that is, a valid ASCII character string is also a valid Latin-1- and UTF-8-encoded string. This is also true when the data is stored in files: every ASCII file is a valid UTF-8 file, because ASCII is a 7-bit subset of UTF-8.
Conversely, the UTF-8 encoding is binary compatible with ASCII for all character codes less than 128. Latin-1 and UTF-8 simply allow for additional characters: Latin-1 for characters mapped to values 128 through 255 within a byte, and UTF-8 for characters that may be represented with multiple bytes. Other encodings allow wider character sets in similar ways, but all of these—ASCII, Latin-1, UTF-8, and many others—are considered to be Unicode.
To Python programmers, encodings are specified as strings containing the encoding’s name. Python comes with roughly 100 different encodings; see the Python library reference for a complete list. Importing the module encodings and running help(encodings) shows you many encoding names as well; some are implemented in Python, and some in C. Some encodings have multiple names, too; for example, latin-1, iso_8859_1, and 8859 are all synonyms for the same encoding, Latin-1. We’ll revisit encodings later in this chapter, when we study techniques for writing Unicode strings in a script.
For more on the Unicode story, see the Python standard manual set. It includes a “Unicode HOWTO” in its “Python HOWTOs” section, which provides additional background that we will skip here in the interest of space.
Python’s String Types
At a more concrete level, the Python language provides string data types to represent character text in your scripts. The string types you will use in your scripts depend upon the version of Python you’re using. Python 2.X has a general string type for representing binary data and simple 8-bit text like ASCII, along with a specific type for representing multibyte Unicode text:
str for representing 8-bit text and binary data
unicode for representing wide-character Unicode text
Python 2.X’s two string types are different (unicode allows for the extra size of characters and has extra support for encoding and decoding), but their operation sets largely overlap. The str string type in 2.X is used for text that can be represented with 8-bit bytes, as well as binary data that represents absolute byte values.
By contrast, Python 3.X comes with three string object types—one for textual data and two for binary data:
str for representing Unicode text (both 8-bit and wider)
bytes for representing binary data
bytearray, a mutable flavor of the bytes type
As mentioned earlier, bytearray is also available in Python 2.6, but it’s simply a back-port from 3.0 with less content-specific behavior and is generally considered a 3.0 type.
All three string types in 3.0 support similar operation sets, but they have different roles. The main goal behind this change in 3.X was to merge the normal and Unicode string types of 2.X into a single string type that supports both normal and Unicode text: developers wanted to remove the 2.X string dichotomy and make Unicode processing more natural. Given that ASCII and other 8-bit text is really a simple kind of Unicode,