Learning Python - Mark Lutz [486]
To achieve this, the 3.0 str type is defined as an immutable sequence of characters (not necessarily bytes), which may be either normal text such as ASCII with one byte per character, or richer character set text such as UTF-8 Unicode that may include multibyte characters. Strings processed by your script with this type are encoded per the platform default, but explicit encoding names may be provided to translate str objects to and from different schemes, both in memory and when transferring to and from files.
While 3.0’s new str type does achieve the desired string/unicode merging, many programs still need to process raw binary data that is not encoded per any text format. Image and audio files, as well as packed data used to interface with devices or C programs you might process with Python’s struct module, fall into this category. To support processing of truly binary data, therefore, a new type, bytes, also was introduced.
In 2.X, the general str type filled this binary data role, because strings were just sequences of bytes (the separate unicode type handles wide-character strings). In 3.0, the bytes type is defined as an immutable sequence of 8-bit integers representing absolute byte values. Moreover, the 3.0 bytes type supports almost all the same operations that the str type does; this includes string methods, sequence operations, and even re module pattern matching, but not string formatting.
A 3.0 bytes object really is a sequence of small integers, each of which is in the range 0 through 255; indexing a bytes returns an int, slicing one returns another bytes, and running the list built-in on one returns a list of integers, not characters. When processed with operations that assume characters, though, the contents of bytes objects are assumed to be ASCII-encoded bytes (e.g., the isalpha method assumes each byte is an ASCII character code). Further, bytes objects are printed as character strings instead of integers for convenience.
While they were at it, Python developers also added a bytearray type in 3.0. bytearray is a variant of bytes that is mutable and so supports in-place changes. It supports the usual string operations that str and bytes do, as well as many of the same in-place change operations as lists (e.g., the append and extend methods, and assignment to indexes). Assuming your strings can be treated as raw bytes, bytearray finally adds direct in-place mutability for string data—something not possible without conversion to a mutable type in Python 2, and not supported by Python 3.0’s str or bytes.
Although Python 2.6 and 3.0 offer much the same functionality, they package it differently. In fact, the mapping from 2.6 to 3.0 string types is not direct—2.6’s str equates to both str and bytes in 3.0, and 3.0’s str equates to both str and unicode in 2.6. Moreover, the mutability of 3.0’s bytearray is unique.
In practice, though, this asymmetry is not as daunting as it might sound. It boils down to the following: in 2.6, you will use str for simple text and binary data and unicode for more advanced forms of text; in 3.0, you’ll use str for any kind of text (simple and Unicode) and bytes or bytearray for binary data. In practice, the choice is often made for you by the tools you use—especially in the case of file processing tools, the topic of the next section.
Text and Binary Files
File I/O (input and output) has also been revamped in 3.0 to reflect the str/bytes distinction and automatically support encoding Unicode text. Python now makes a sharp platform-independent distinction between text files and binary files:
Text files
When a file is opened in text mode, reading its data automatically decodes its content (per a platform default or a provided encoding name) and returns it as a str; writing takes a str and automatically encodes it before transferring it to the file. Text-mode files also support universal end-of-line translation and additional encoding specification arguments. Depending on the encoding name, text files may also automatically