Online Book Reader

Home Category

Learning Python - Mark Lutz [503]

By Root 1641 0
but to sample the flavor of this domain, suppose we have a simple XML text file, mybooks.xml:

2009

Learning Python

Programming Python

Python Pocket Reference

O'Reilly Media

and we want to run a script to extract and display the content of all the nested title tags, as follows:

Learning Python

Programming Python

Python Pocket Reference

There are at least four basic ways to accomplish this (not counting more advanced tools like XPath). First, we could run basic pattern matching on the file’s text, though this tends to be inaccurate if the text is unpredictable. Where applicable, the re module we met earlier does the job—its match method looks for a match at the start of a string, search scans ahead for a match, and the findall method used here locates all places where the pattern matches in the string (the result comes back as a list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups):

# File patternparse.py

import re

text = open('mybooks.xml').read()

found = re.findall('(.*)', text)

for title in found: print(title)

Second, to be more robust, we could perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects and provides an interface for navigating the tree to extract tag attributes and values; the interface is a formal specification, independent of Python:

# File domparse.py

from xml.dom.minidom import parse, Node

xmltree = parse('mybooks.xml')

for node1 in xmltree.getElementsByTagName('title'):

for node2 in node1.childNodes:

if node2.nodeType == Node.TEXT_NODE:

print(node2.data)

As a third option, Python’s standard library supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses and use state information to keep track of where they are in the document and collect its data:

# File saxparse.py

import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):

def __init__(self):

self.inTitle = False

def startElement(self, name, attributes):

if name == 'title':

self.inTitle = True

def characters(self, data):

if self.inTitle:

print(data)

def endElement(self, name):

if name == 'title':

self.inTitle = False

import xml.sax

parser = xml.sax.make_parser()

handler = BookHandler()

parser.setContentHandler(handler)

parser.parse('mybooks.xml')

Finally, the ElementTree system available in the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document:

# File etreeparse.py

from xml.etree.ElementTree import parse

tree = parse('mybooks.xml')

for E in tree.findall('title'):

print(E.text)

When run in either 2.6 or 3.0, all four of these scripts display the same printed result:

C:\misc> c:\python26\python domparse.py

Learning Python

Programming Python

Python Pocket Reference

C:\misc> c:\python30\python domparse.py

Learning Python

Programming Python

Python Pocket Reference

Technically, though, in 2.6 some of these scripts produce unicode string objects, while in 3.0 all produce str strings, since that type includes Unicode text (whether ASCII or other):

C:\misc> c:\python30\python

>>> from xml.dom.minidom import parse, Node

>>> xmltree = parse('mybooks.xml')

>>> for node in xmltree.getElementsByTagName('title'):

... for node2 in node.childNodes:

... if node2.nodeType == Node.TEXT_NODE:

... node2.data

...

'Learning Python'

'Programming Python'

'Python Pocket Reference'

C:\misc> c:\python26\python

>>> ...same code...

...

u'Learning Python'

u'Programming Python'

u'Python Pocket Reference'

Programs that must deal with XML parsing results in nontrivial ways will need to account for the different object type in 3.0. Again, though, because all strings have nearly identical interfaces

Return Main Page Previous Page Next Page

®Online Book Reader