Learning Python - Mark Lutz [503]
and we want to run a script to extract and display the content of all the nested title tags, as follows:
Learning Python
Programming Python
Python Pocket Reference
There are at least four basic ways to accomplish this (not counting more advanced tools like XPath). First, we could run basic pattern matching on the file’s text, though this tends to be inaccurate if the text is unpredictable. Where applicable, the re module we met earlier does the job—its match method looks for a match at the start of a string, search scans ahead for a match, and the findall method used here locates all places where the pattern matches in the string (the result comes back as a list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups):
# File patternparse.py
import re
text = open('mybooks.xml').read()
found = re.findall('
for title in found: print(title)
Second, to be more robust, we could perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects and provides an interface for navigating the tree to extract tag attributes and values; the interface is a formal specification, independent of Python:
# File domparse.py
from xml.dom.minidom import parse, Node
xmltree = parse('mybooks.xml')
for node1 in xmltree.getElementsByTagName('title'):
for node2 in node1.childNodes:
if node2.nodeType == Node.TEXT_NODE:
print(node2.data)
As a third option, Python’s standard library supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses and use state information to keep track of where they are in the document and collect its data:
# File saxparse.py
import xml.sax.handler
class BookHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inTitle = False
def startElement(self, name, attributes):
if name == 'title':
self.inTitle = True
def characters(self, data):
if self.inTitle:
print(data)
def endElement(self, name):
if name == 'title':
self.inTitle = False
import xml.sax
parser = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse('mybooks.xml')
Finally, the ElementTree system available in the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document:
# File etreeparse.py
from xml.etree.ElementTree import parse
tree = parse('mybooks.xml')
for E in tree.findall('title'):
print(E.text)
When run in either 2.6 or 3.0, all four of these scripts display the same printed result:
C:\misc> c:\python26\python domparse.py
Learning Python
Programming Python
Python Pocket Reference
C:\misc> c:\python30\python domparse.py
Learning Python
Programming Python
Python Pocket Reference
Technically, though, in 2.6 some of these scripts produce unicode string objects, while in 3.0 all produce str strings, since that type includes Unicode text (whether ASCII or other):
C:\misc> c:\python30\python
>>> from xml.dom.minidom import parse, Node
>>> xmltree = parse('mybooks.xml')
>>> for node in xmltree.getElementsByTagName('title'):
... for node2 in node.childNodes:
... if node2.nodeType == Node.TEXT_NODE:
... node2.data
...
'Learning Python'
'Programming Python'
'Python Pocket Reference'
C:\misc> c:\python26\python
>>> ...same code...
...
u'Learning Python'
u'Programming Python'
u'Python Pocket Reference'
Programs that must deal with XML parsing results in nontrivial ways will need to account for the different object type in 3.0. Again, though, because all strings have nearly identical interfaces