Online Book Reader

Home Category

Learning Python - Mark Lutz [489]

By Root 1384 0
string, not its str converted form (this is usually not what you’ll want!). Assuming B and S are still as in the prior listing:

>>> import sys

>>> sys.platform # Underlying platform

'win32'

>>> sys.getdefaultencoding() # Default encoding for str here

'utf-8'

>>> bytes(S)

TypeError: string argument without an encoding

>>> str(B) # str without encoding

"b'spam'" # A print string, not conversion!

>>> len(str(B))

7

>>> len(str(B, encoding='ascii')) # Use encoding to convert to str

4

Coding Unicode Strings

Encoding and decoding become more meaningful when you start dealing with actual non-ASCII Unicode text. To code arbitrary Unicode characters in your strings, some of which you might not even be able to type on your keyboard, Python string literals support both "\xNN" hex byte value escapes and "\uNNNN" and "\UNNNNNNNN" Unicode escapes in string literals. In Unicode escapes, the first form gives four hex digits to encode a 2-byte (16-bit) character code, and the second gives eight hex digits for a 4-byte (32-bit) code.

Coding ASCII Text

Let’s step through some examples that demonstrate text coding basics. As we’ve seen, ASCII text is a simple type of Unicode, stored as a sequence of byte values that represent characters:

C:\misc> c:\python30\python

>>> ord('X') # 'X' has binary value 88 in the default encoding

88

>>> chr(88) # 88 stands for character 'X'

'X'

>>> S = 'XYZ' # A Unicode string of ASCII text

>>> S

'XYZ'

>>> len(S) # 3 characters long

3

>>> [ord(c) for c in S] # 3 bytes with integer ordinal values

[88, 89, 90]

Normal 7-bit ASCII text like this is represented with one character per byte under each of the Unicode encoding schemes described earlier in this chapter:

>>> S.encode('ascii') # Values 0..127 in 1 byte (7 bits) each

b'XYZ'

>>> S.encode('latin-1') # Values 0..255 in 1 byte (8 bits) each

b'XYZ'

>>> S.encode('utf-8') # Values 0..127 in 1 byte, 128..2047 in 2, others 3 or 4

b'XYZ'

In fact, the bytes objects returned by encoding ASCII text this way is really a sequence of short integers, which just happen to print as ASCII characters when possible:

>>> S.encode('latin-1')[0]

88

>>> list(S.encode('latin-1'))

[88, 89, 90]

Coding Non-ASCII Text

To code non-ASCII characters, you may use hex or Unicode escapes in your strings; hex escapes are limited to a single byte’s value, but Unicode escapes can name characters with values two and four bytes wide. The hex values 0xCD and 0xE8, for instance, are codes for two special accented characters outside the 7-bit range of ASCII, but we can embed them in 3.0 str objects because str supports Unicode today:

>>> chr(0xc4) # 0xC4, 0xE8: characters outside ASCII's range

'Ä'

>>> chr(0xe8)

'è'

>>> S = '\xc4\xe8' # Single byte 8-bit hex escapes

>>> S

'Äè'

>>> S = '\u00c4\u00e8' # 16-bit Unicode escapes

>>> S

'Äè'

>>> len(S) # 2 characters long (not number of bytes!)

2

Encoding and Decoding Non-ASCII text

Now, if we try to encode a non-ASCII string into raw bytes using as ASCII, we’ll get an error. Encoding as Latin-1 works, though, and allocates one byte per character; encoding as UTF-8 allocates 2 bytes per character instead. If you write this string to a file, the raw bytes shown here is what is actually stored on the file for the encoding types given:

>>> S = '\u00c4\u00e8'

>>> S

'Äè'

>>> len(S)

2

>>> S.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1:

ordinal not in range(128)

>>> S.encode('latin-1') # One byte per character

b'\xc4\xe8'

>>> S.encode('utf-8') # Two bytes per character

b'\xc3\x84\xc3\xa8'

>>> len(S.encode('latin-1')) # 2 bytes in latin-1, 4 in utf-8

2

>>> len(S.encode('utf-8'))

4

Note that you can also go the other way, reading raw bytes from a file and decoding them back to a Unicode string. However, as we’ll see later, the encoding mode you give to the open call causes this decoding to be done for you automatically on input (and avoids issues that may arise from reading partial character sequences

Return Main Page Previous Page Next Page

®Online Book Reader