Learning Python - Mark Lutz [497]
# bytearrays work too
>>> BA = bytearray(b'\x01\x02\x03')
>>> open('temp', 'wb').write(BA)
3
>>> open('temp', 'r').read()
'\x01\x02\x03'
>>> open('temp', 'rb').read()
b'\x01\x02\x03'
Type and Content Mismatches
Notice that you cannot get away with violating Python’s str/bytes type distinction when it comes to files. As the following examples illustrate, we get errors (shortened here) if we try to write a bytes to a text file or a str to a binary file:
# Types are not flexible for file content
>>> open('temp', 'w').write('abc\n') # Text mode makes and requires str
4
>>> open('temp', 'w').write(b'abc\n')
TypeError: can't write bytes to text stream
>>> open('temp', 'wb').write(b'abc\n') # Binary mode makes and requires bytes
4
>>> open('temp', 'wb').write('abc\n')
TypeError: can't write str to binary stream
This makes sense: text has no meaning in binary terms, before it is encoded. Although it is often possible to convert between the types by encoding str and decoding bytes, as described earlier in this chapter, you will usually want to stick to either str for text data or bytes for binary data. Because the str and bytes operation sets largely intersect, the choice won’t be much of a dilemma for most programs (see the string tools coverage in the final section of this chapter for some prime examples of this).
In addition to type constraints, file content can matter in 3.0. Text-mode output files require a str instead of a bytes for content, so there is no way in 3.0 to write truly binary data to a text-mode file. Depending on the encoding rules, bytes outside the default character set can sometimes be embedded in a normal string, and they can always be written in binary mode. However, because text-mode input files in 3.0 must be able to decode content per a Unicode encoding, there is no way to read truly binary data in text mode:
# Can't read truly binary data in text mode
>>> chr(0xFF) # FF is a valid char, FE is not
'ÿ'
>>> chr(0xFE)
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1...
>>> open('temp', 'w').write(b'\xFF\xFE\xFD') # Can't use arbitrary bytes!
TypeError: can't write bytes to text stream
>>> open('temp', 'w').write('\xFF\xFE\xFD') # Can write if embeddable in str
3
>>> open('temp', 'wb').write(b'\xFF\xFE\xFD') # Can also write in binary mode
3
>>> open('temp', 'rb').read() # Can always read as binary bytes
b'\xff\xfe\xfd'
>>> open('temp', 'r').read() # Can't read text unless decodable!
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-3: ...
This last error stems from the fact that all text files in 3.0 are really Unicode text files, as the next section describes.
Using Unicode Files
So far, we’ve been reading and writing basic text and binary files, but what about processing Unicode files? It turns out to be easy to read and write Unicode text stored in files, because the 3.0 open call accepts an encoding for text files, which does the encoding and decoding for us automatically as data is transferred. This allows us to process Unicode text created with different encodings than the default for the platform, and store in different encodings to convert.
Reading and Writing Unicode in 3.0
In fact, we can convert a string to different encodings both manually with method calls and automatically on file input and output. We’ll use the following Unicode string in this section to demonstrate:
C:\misc> c:\python30\python
>>> S = 'A\xc4B\xe8C' # 5-character string, non-ASCII
>>> S
'AÄBèC'
>>> len(S)
5
Manual encoding
As we’ve already learned, we can always encode such a string to raw bytes according to the target encoding name:
# Encode manually with methods
>>> L = S.encode('latin-1') # 5 bytes when encoded as latin-1
>>> L
b'A\xc4B\xe8C'
>>> len(L)
5
>>> U = S.encode('utf-8') # 7 bytes when encoded as utf-8
>>> U
b'A\xc3\x84B\xc3\xa8C'
>>> len(U)
7
File output encoding
Now, to write our string