Learning Python - Mark Lutz [491]
>>> T
b'\xc1c\xc2T\xc3'
>>> U = T.decode('cp500') # Convert back to Unicode
>>> U
'AÄBèC'
>>> U.encode() # Default utf-8 encoding again
b'A\xc3\x84B\xc3\xa8C'
Keep in mind that the special Unicode and hex character escapes are only necessary when you code non-ASCII Unicode strings manually. In practice, you’ll often load such text from files instead. As we’ll see later in this chapter, 3.0’s file object (created with the open built-in function) automatically decodes text strings as they are read and encodes them when they are written; because of this, your script can often deal with strings generically, without having to code special characters directly.
Later in this chapter we’ll also see that it’s possible to convert between encodings when transferring strings to and from files, using a technique very similar to that in the last example; although you’ll still need to provide explicit encoding names when opening a file, the file interface does most of the conversion work for you automatically.
Coding Unicode Strings in Python 2.6
Now that I’ve shown you the basics of Unicode strings in 3.0, I need to explain that you can do much the same in 2.6, though the tools differ. unicode is available in Python 2.6, but it is a distinct data type from str, and it allows free mixing of normal and Unicode strings when they are compatible. In fact, you can essentially pretend 2.6’s str is 3.0’s bytes when it comes to decoding raw bytes into a Unicode string, as long as it’s in the proper form. Here is 2.6 in action; unicode characters display in hex in 2.6 unless you explicitly print, and non-ASCII displays can vary per shell (most of this section ran in IDLE):
C:\misc> c:\python26\python
>>> import sys
>>> sys.version
'2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]'
>>> S = 'A\xC4B\xE8C' # String of 8-bit bytes
>>> print S # Some are non-ASCII
AÄBèC
>>> S.decode('latin-1') # Decode byte to latin-1 Unicode
u'A\xc4B\xe8C'
>>> S.decode('utf-8') # Not formatted as utf-8
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid data
>>> S.decode('ascii') # Outside ASCII range
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal
not in range(128)
To store arbitrarily encoded Unicode text, make a unicode object with the u'xxx' literal form (this literal is no longer available in 3.0, since all strings support Unicode in 3.0):
>>> U = u'A\xC4B\xE8C' # Make Unicode string, hex escapes
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC
Once you’ve created it, you can convert Unicode text to different raw byte encodings, similar to encoding str objects into bytes objects in 3.0:
>>> U.encode('latin-1') # Encode per latin-1: 8-bit bytes
'A\xc4B\xe8C'
>>> U.encode('utf-8') # Encode per utf-8: multibyte
'A\xc3\x84B\xc3\xa8C'
Non-ASCII characters can be coded with hex or Unicode escapes in string literals in 2.6, just as in 3.0. However, as with bytes in 3.0, the "\u..." and "\U..." escapes are recognized only for unicode strings in 2.6, not 8-bit str strings:
C:\misc> c:\python26\python
>>> U = u'A\xC4B\xE8C' # Hex escapes for non-ASCII
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC
>>> U = u'A\u00C4B\U000000E8C' # Unicode escapes for non-ASCII
>>> U # u'' = 16 bits, U'' = 32 bits
u'A\xc4B\xe8C'
>>> print U
AÄBèC
>>> S = 'A\xC4B\xE8C' # Hex escapes work
>>> S
'A\xc4B\xe8C'
>>> print S # But some print oddly, unless decoded
A-BFC
>>> print S.decode('latin-1')
AÄBèC
>>> S = 'A\u00C4B\U000000E8C' # Not Unicode escapes: taken literally!
>>> S
'A\\u00C4B\\U000000E8C'
>>> print S
A\u00C4B\U000000E8C
>>> len(S)
19
Like 3.0’s str and bytes, 2.6’s unicode and str share nearly identical operation sets, so unless you need to convert to other encodings you can often treat unicode as though it were str. One of the primary differences between 2.6 and 3.0, though, is that unicode and non-Unicode str objects can be freely mixed in expressions, and as long as the str is compatible with the unicode’s encoding Python