The task is to parse a simple XML document, and analyze the contents by line number.
The right Python package seems to be xml.sax
. But how do I use it?
After some digging in the documentation, I found:
- The
xmlreader.Locator
interface has the information:getLineNumber()
. - The
handler.ContentHandler
interface hassetDocumentHandler()
.
The first thought would be to create a Locator
, pass this to the ContentHandler
, and read the information off the Locator during calls to its character()
methods, etc.
BUT, xmlreader.Locator
is only a skeleton interface, and can only return -1 from any of its methods.
So as a poor user, WHAT am I to do, short of writing a whole Parser
and Locator
of my own??
I'll answer my own question presently.
(Well I would have, except for the arbitrary, annoying rule that says I can't.)
I was unable to figure this out using the existing documentation (or by web searches), and was forced to read the source code for xml.sax
(under /usr/lib/python2.7/xml/sax/ on my system).
The xml.sax
function make_parser()
by default creates a real Parser
, but what kind of thing is that?
In the source code one finds that it is an ExpatParser
, defined in expatreader.py.
And...it has its own Locator
, an ExpatLocator
. But, there is no access to this thing.
Much head-scratching came between this and a solution.
- write your own
ContentHandler
, which knows about aLocato
r, and uses it to determine line numbers - create an
ExpatParser
withxml.sax.make_parser()
- create an
ExpatLocator
, passing it theExpatParser
instance. - make the
ContentHandler
, giving it thisExpatLocator
- pass the
ContentHandler
to the parser'ssetContentHandler()
- call
parse()
on theParser
.
For example:
import sys
import xml.sax
class EltHandler( xml.sax.handler.ContentHandler ):
def __init__( self, locator ):
xml.sax.handler.ContentHandler.__init__( self )
self.loc = locator
self.setDocumentLocator( self.loc )
def startElement( self, name, attrs ): pass
def endElement( self, name ): pass
def characters( self, data ):
lineNo = self.loc.getLineNumber()
print >> sys.stdout, "LINE", lineNo, data
def spit_lines( filepath ):
try:
parser = xml.sax.make_parser()
locator = xml.sax.expatreader.ExpatLocator( parser )
handler = EltHandler( locator )
parser.setContentHandler( handler )
parser.parse( filepath )
except IOError as e:
print >> sys.stderr, e
if len( sys.argv ) > 1:
filepath = sys.argv[1]
spit_lines( filepath )
else:
print >> sys.stderr, "Try providing a path to an XML file."
Martijn Pieters points out below another approach with some advantages.
If the superclass initializer of the ContentHandler
is properly called,
then it turns out a private-looking, undocumented member ._locator
is
set, which ought to contain a proper Locator
.
Advantage: you don't have to create your own Locator
(or find out how to create it).
Disadvantage: it's nowhere documented, and using an undocumented private variable is sloppy.
Thanks Martijn!