Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Learning Python - Mark Lutz [499]

By Root 1781 0

to skip the marker:

>>> open('spam.txt', 'rb').read() # UTF-8 with 3-byte BOM

b'\xef\xbb\xbfspam\r\nSPAM\r\n'

>>> open('spam.txt', 'r').read()

'ï»¿spam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-8').read()

'\ufeffspam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-8-sig').read()

'spam\nSPAM\n'

If the file is stored as “Unicode big endian” in Notepad, we get UTF-16-format data in the file, prepended with a two-byte BOM sequence—the encoding name “utf-16” in Python skips the BOM because it is implied (since all UTF-16 files have a BOM), and “utf-16-be” handles the big-endian format but does not skip the BOM:

>>> open('spam.txt', 'rb').read()

b'\xfe\xff\x00s\x00p\x00a\x00m\x00\r\x00\n\x00S\x00P\x00A\x00M\x00\r\x00\n'

>>> open('spam.txt', 'r').read()

UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1:...

>>> open('spam.txt', 'r', encoding='utf-16').read()

'spam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-16-be').read()

'\ufeffspam\nSPAM\n'

The same is generally true for output. When writing a Unicode file in Python code, we need a more explicit encoding name to force the BOM in UTF-8—“utf-8” does not write (or skip) the BOM, but “utf-8-sig” does:

>>> open('temp.txt', 'w', encoding='utf-8').write('spam\nSPAM\n')

>>> open('temp.txt', 'rb').read() # No BOM

b'spam\r\nSPAM\r\n'

>>> open('temp.txt', 'w', encoding='utf-8-sig').write('spam\nSPAM\n')

>>> open('temp.txt', 'rb').read() # Wrote BOM

b'\xef\xbb\xbfspam\r\nSPAM\r\n'

>>> open('temp.txt', 'r').read()

'ï»¿spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8').read() # Keeps BOM

'\ufeffspam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8-sig').read() # Skips BOM

'spam\nSPAM\n'

Notice that although “utf-8” does not drop the BOM, data without a BOM can be read with both “utf-8” and “utf-8-sig”—use the latter for input if you’re not sure whether a BOM is present in a file (and don’t read this paragraph out loud in an airport security line!):

>>> open('temp.txt', 'w').write('spam\nSPAM\n')

>>> open('temp.txt', 'rb').read() # Data without BOM

b'spam\r\nSPAM\r\n'

>>> open('temp.txt', 'r').read() # Any utf-8 works

'spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8').read()

'spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8-sig').read()

'spam\nSPAM\n'

Finally, for the encoding name “utf-16,” the BOM is handled automatically: on output, data is written in the platform’s native endianness, and the BOM is always written; on input, data is decoded per the BOM, and the BOM is always stripped. More specific UTF-16 encoding names can specify different endianness, though you may have to manually write and skip the BOM yourself in some scenarios if it is required or present:

>>> sys.byteorder

'little'

>>> open('temp.txt', 'w', encoding='utf-16').write('spam\nSPAM\n')

>>> open('temp.txt', 'rb').read()

b'\xff\xfes\x00p\x00a\x00m\x00\r\x00\n\x00S\x00P\x00A\x00M\x00\r\x00\n\x00'

>>> open('temp.txt', 'r', encoding='utf-16').read()

'spam\nSPAM\n'

>>> open('temp.txt', 'w', encoding='utf-16-be').write('\ufeffspam\nSPAM\n')

>>> open('spam.txt', 'rb').read()

b'\xfe\xff\x00s\x00p\x00a\x00m\x00\r\x00\n\x00S\x00P\x00A\x00M\x00\r\x00\n'

>>> open('temp.txt', 'r', encoding='utf-16').read()

'spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-16-be').read()

'\ufeffspam\nSPAM\n'

The more specific UTF-16 encoding names work fine with BOM-less files, though “utf-16” requires one on input in order to determine byte order:

>>> open('temp.txt', 'w', encoding='utf-16-le').write('SPAM')

>>> open('temp.txt', 'rb').read() # OK if BOM not present or expected

b'S\x00P\x00A\x00M\x00'

>>> open('temp.txt', 'r', encoding='utf-16-le').read()

'SPAM'

>>> open('temp.txt', 'r', encoding='utf-16').read()

UnicodeError: UTF-16 stream does not start with BOM

Experiment with these encodings yourself or see Python’s library manuals for more details on the BOM.

Unicode Files in 2.6

The preceding discussion applies to Python 3.0’s string types and files. You can achieve

Online Book Reader

Learning Python - Mark Lutz [499]

®Online Book Reader