Learning Python - Mark Lutz [492]
>>> u'ab' + 'cd' # Can mix if compatible in 2.6
u'abcd' # 'ab' + b'cd' not allowed in 3.0
In fact, the difference in types is often trivial to your code in 2.6. Like normal strings, Unicode strings may be concatenated, indexed, sliced, matched with the re module, and so on, and they cannot be changed in-place. If you ever need to convert between the two types explicitly, you can use the built-in str and unicode functions:
>>> str(u'spam') # Unicode to normal
'spam'
>>> unicode('spam') # Normal to Unicode
u'spam'
However, this liberal approach to mixing string types in 2.6 only works if the string is compatible with the unicode object’s encoding type:
>>> S = 'A\xC4B\xE8C' # Can't mix if incompatible
>>> U = u'A\xC4B\xE8C'
>>> S + U
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal
not in range(128)
>>> S.decode('latin-1') + U # Manual conversion still required
u'A\xc4B\xe8CA\xc4B\xe8C'
>>> print S.decode('latin-1') + U
AÄBèCAÄBèC
Finally, as we’ll see in more detail later in this chapter, 2.6’s open call supports only files of 8-bit bytes, returning their contents as str strings; it’s up to you to interpret the contents as text or binary data and decode if needed. To read and write Unicode files and encode or decode their content automatically, use 2.6’s codecs.open call, documented in the 2.6 library manual. This call provides much the same functionality as 3.0’s open and uses 2.6 unicode objects to represent file content—reading a file translates encoded bytes into decoded Unicode characters, and writing translates strings to the desired encoding specified when the file is opened.
Source File Character Set Encoding Declarations
Unicode escape codes are fine for the occasional Unicode character in string literals, but they can become tedious if you need to embed non-ASCII text in your strings frequently. For strings you code within your script files, Python uses the UTF-8 encoding by default, but it allows you to change this to support arbitrary character sets by including a comment that names your desired encoding. The comment must be of this form and must appear as either the first or second line in your script in either Python 2.6 or 3.0:
# -*- coding: latin-1 -*-
When a comment of this form is present, Python will recognize strings represented natively in the given encoding. This means you can edit your script file in a text editor that accepts and displays accented and other non-ASCII characters correctly, and Python will decode them correctly in your string literals. For example, notice how the comment at the top of the following file, text.py, allows Latin-1 characters to be embedded in strings:
# -*- coding: latin-1 -*-
# Any of the following string literal forms work in latin-1.
# Changing the encoding above to either ascii or utf-8 fails,
# because the 0xc4 and 0xe8 in myStr1 are not valid in either.
myStr1 = 'aÄBèC'
myStr2 = 'A\u00c4B\U000000e8C'
myStr3 = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'
import sys
print('Default encoding:', sys.getdefaultencoding())
for aStr in myStr1, myStr2, myStr3:
print('{0}, strlen={1}, '.format(aStr, len(aStr)), end='')
bytes1 = aStr.encode() # Per default utf-8: 2 bytes for non-ASCII
bytes2 = aStr.encode('latin-1') # One byte per char
#bytes3 = aStr.encode('ascii') # ASCII fails: outside 0..127 range
print('byteslen1={0}, byteslen2={1}'.format(len(bytes1), len(bytes2)))
When run, this script produces the following output:
C:\misc> c:\python30\python text.py
Default encoding: utf-8
aÄBèC, strlen=5, byteslen1=7, byteslen2=5
AÄBèC, strlen=5, byteslen1=7, byteslen2=5
AÄBèC, strlen=5, byteslen1=7, byteslen2=5
Since most programmers are likely to fall back on the standard UTF-8 encoding, I’ll defer to Python’s standard manual set for more details on this option and other advanced Unicode support topics, such as properties and character name escapes in