Learning Python - Mark Lutz [504]
Regrettably, going into further XML parsing details is beyond this book’s scope. If you are interested in text or XML parsing, it is covered in more detail in the applications-focused follow-up book Programming Python. For more details on re, struct, pickle, and XML tools in general, consult the Web, the aforementioned book and others, and Python’s standard library manual.
Chapter Summary
This chapter explored advanced string types available in Python 3.0 and 2.6 for processing Unicode text and binary data. As we saw, many programmers use ASCII text and can get by with the basic string type and its operations. For more advanced applications, Python’s string models fully support both wide-character Unicode text (via the normal string type in 3.0 and a special type in 2.6) and byte-oriented data (represented with a bytes type in 3.0 and normal strings in 2.6).
In addition, we learned how Python’s file object has mutated in 3.0 to automatically encode and decode Unicode text and deal with byte strings for binary-mode files. Finally, we briefly met some text and binary data tools in Python’s library, and sampled their behavior in 3.0.
In the next chapter, we’ll shift our focus to tool-builder topics, with a look at ways to manage access to object attributes by inserting automatically run code. Before we move on, though, here’s a set of questions to review what we’ve learned here.
Test Your Knowledge: Quiz
What are the names and roles of string object types in Python 3.0?
What are the names and roles of string object types in Python 2.6?
What is the mapping between 2.6 and 3.0 string types?
How do Python 3.0’s string types differ in terms of operations?
How can you code non-ASCII Unicode characters in a string in 3.0?
What are the main differences between text- and binary-mode files in Python 3.0?
How would you read a Unicode text file that contains text in a different encoding than the default for your platform?
How can you create a Unicode text file in a specific encoding format?
Why is ASCII text considered to be a kind of Unicode text?
How large an impact does Python 3.0’s string types change have on your code?
Test Your Knowledge: Answers
Python 3.0 has three string types: str (for Unicode text, including ASCII), bytes (for binary data with absolute byte values), and bytearray (a mutable flavor of bytes). The str type usually represents content stored on a text file, and the other two types generally represent content stored on binary files.
Python 2.6 has two main string types: str (for 8-bit text and binary data) and unicode (for wide-character text). The str type is used for both text and binary file content; unicode is used for text file content that is generally more complex than 8 bits. Python 2.6 (but not earlier) also has 3.0’s bytearray type, but it’s mostly a back-port and doesn’t exhibit the sharp text/binary distinction that it does in 3.0.
The mapping from 2.6 to 3.0 string types is not direct, because 2.6’s str equates to both str and bytes in 3.0, and 3.0’s str equates to both str and unicode in 2.6. The mutability of bytearray in 3.0 is also unique.
Python 3.0’s string types share almost all the same operations: method calls, sequence operations, and even larger tools like pattern matching work the same way. On the other hand, only str supports string formatting operations, and bytearray has an additional set of operations that perform in-place changes. The str and bytes types also have methods for encoding and decoding text, respectively.
Non-ASCII Unicode characters can be coded in a string with both hex (\xNN) and Unicode (\uNNNN, \UNNNNNNNN) escapes. On some keyboards, some non-ASCII characters—certain Latin-1 characters, for example—can also be typed directly.
In 3.0, text-mode files assume their file content is Unicode text (even if it’s ASCII) and automatically decode when reading and encode when writing. With binary-mode