Online Book Reader

Home Category

Learning Python - Mark Lutz [487]

By Root 1421 0
process the byte order mark sequence at the start of a file (more on this momentarily).

Binary files

When a file is opened in binary mode by adding a b (lowercase only) to the mode string argument in the built-in open call, reading its data does not decode it in any way but simply returns its content raw and unchanged, as a bytes object; writing similarly takes a bytes object and transfers it to the file unchanged. Binary-mode files also accept a bytearray object for the content to be written to the file.

Because the language sharply differentiates between str and bytes, you must decide whether your data is text or binary in nature and use either str or bytes objects to represent its content in your script, as appropriate. Ultimately, the mode in which you open a file will dictate which type of object your script will use to represent its content:

If you are processing image files, packed data created by other programs whose content you must extract, or some device data streams, chances are good that you will want to deal with it using bytes and binary-mode files. You might also opt for bytearray if you wish to update the data without making copies of it in memory.

If instead you are processing something that is textual in nature, such as program output, HTML, internationalized text, or CSV or XML files, you’ll probably want to use str and text-mode files.

Notice that the mode string argument to built-in function open (its second argument) becomes fairly crucial in Python 3.0—its content not only specifies a file processing mode, but also implies a Python object type. By adding a b to the mode string, you specify binary mode and will receive, or must provide, a bytes object to represent the file’s content when reading or writing. Without the b, your file is processed in text mode, and you’ll use str objects to represent its content in your script. For example, the modes rb, wb, and rb+ imply bytes; r, w+, and rt (the default) imply str.

Text-mode files also handle the byte order marker (BOM) sequence that may appear at the start of files under certain encoding schemes. In the UTF-16 and UTF-32 encodings, for example, the BOM specifies big- or little-endian format (essentially, which end of a bitstring is most significant). A UTF-8 text file may also include a BOM to declare that it is UTF-8 in general, but this isn’t guaranteed. When reading and writing data using these encoding schemes, Python automatically skips or writes the BOM if it is implied by a general encoding name or if you provide a more specific encoding name to force the issue. For example, the BOM is always processed for “utf-16,” the more specific encoding name “utf-16-le” species little-endian UTF-16 format, and the more specific encoding name “utf-8-sig” forces Python to both skip and write a BOM on input and output, respectively, for UTF-8 text (the general name “utf-8” does not).

We’ll learn more about BOMs and files in general in the section Handling the BOM in 3.0. First, let’s explore the implications of Python’s new Unicode string model.

Python 3.0 Strings in Action

Let’s step through a few examples that demonstrate how the 3.0 string types are used. One note up front: the code in this section was run with and applies to 3.0 only. Still, basic string operations are generally portable across Python versions. Simple ASCII strings represented with the str type work the same in 2.6 and 3.0 (and exactly as we saw in Chapter 7 of this book). Moreover, although there is no bytes type in Python 2.6 (it has just the general str), it can usually run code that thinks there is—in 2.6, the call bytes(X) is present as a synonym for str(X), and the new literal form b'...' is taken to be the same as the normal string literal '...'. You may still run into version skew in some isolated cases, though; the 2.6 bytes call, for instance, does not allow the second argument (encoding name) required by 3.0’s bytes.

Literals and Basic Properties

Python 3.0 string objects originate when you call a built-in function such as str or bytes,

Return Main Page Previous Page Next Page

®Online Book Reader