Online Book Reader

Home Category

Learning Python - Mark Lutz [498]

By Root 1660 0
to a text file in a particular encoding, we can simply pass the desired encoding name to open—although we could manually encode first and write in binary mode, there’s no need to:

# Encoding automatically when written

>>> open('latindata', 'w', encoding='latin-1').write(S) # Write as latin-1

5

>>> open('utf8data', 'w', encoding='utf-8').write(S) # Write as utf-8

5

>>> open('latindata', 'rb').read() # Read raw bytes

b'A\xc4B\xe8C'

>>> open('utf8data', 'rb').read() # Different in files

b'A\xc3\x84B\xc3\xa8C'

File input decoding

Similarly, to read arbitrary Unicode data, we simply pass in the file’s encoding type name to open, and it decodes from raw bytes to strings automatically; we could read raw bytes and decode manually too, but that can be tricky when reading in blocks (we might read an incomplete character), and it isn’t necessary:

# Decoding automatically when read

>>> open('latindata', 'r', encoding='latin-1').read() # Decoded on input

'AÄBèC'

>>> open('utf8data', 'r', encoding='utf-8').read() # Per encoding type

'AÄBèC'

>>> X = open('latindata', 'rb').read() # Manual decoding:

>>> X.decode('latin-1') # Not necessary

'AÄBèC'

>>> X = open('utf8data', 'rb').read()

>>> X.decode() # UTF-8 is default

'AÄBèC'

Decoding mismatches

Finally, keep in mind that this behavior of files in 3.0 limits the kind of content you can load as text. As suggested in the prior section, Python 3.0 really must be able to decode the data in text files into a str string, according to either the default or a passed-in Unicode encoding name. Trying to open a truly binary data file in text mode, for example, is unlikely to work in 3.0 even if you use the correct object types:

>>> file = open('python.exe', 'r')

>>> text = file.read()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2: ...

>>> file = open('python.exe', 'rb')

>>> data = file.read()

>>> data[:20]

b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff\x00\x00\xb8\x00\x00\x00'

The first of these examples might not fail in Python 2.X (normal files do not decode text), even though it probably should: reading the file may return corrupted data in the string, due to automatic end-of-line translations in text mode (any embedded \r\n bytes will be translated to \n on Windows when read). To treat file content as Unicode text in 2.6, we need to use special tools instead of the general open built-in function, as we’ll see in a moment. First, though, let’s turn to a more explosive topic....

Handling the BOM in 3.0

As described earlier in this chapter, some encoding schemes store a special byte order marker (BOM) sequence at the start of files, to specify data endianness or declare the encoding type. Python both skips this marker on input and writes it on output if the encoding name implies it, but we sometimes must use a specific encoding name to force BOM processing explicitly.

For example, when you save a text file in Windows Notepad, you can specify its encoding type in a drop-down list—simple ASCII text, UTF-8, or little- or big-endian UTF-16. If a one-line text file named spam.txt is saved in Notepad as the encoding type “ANSI,” for instance, it’s written as simple ASCII text without a BOM. When this file is read in binary mode in Python, we can see the actual bytes stored in the file. When it’s read as text, Python performs end-of-line translation by default; we can decode it as explicit UTF-8 text since ASCII is a subset of this scheme (and UTF-8 is Python 3.0’s default encoding):

c:\misc> C:\Python30\python # File saved in Notepad

>>> import sys

>>> sys.getdefaultencoding()

'utf-8'

>>> open('spam.txt', 'rb').read() # ASCII (UTF-8) text file

b'spam\r\nSPAM\r\n'

>>> open('spam.txt', 'r').read() # Text mode translates line-end

'spam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-8').read()

'spam\nSPAM\n'

If this file is instead saved as “UTF-8” in Notepad, it is prepended with a three-byte UTF-8 BOM sequence, and we need to give a more specific encoding name (“utf-8-sig”) to force Python

Return Main Page Previous Page Next Page

®Online Book Reader