I am working with potentially huge XML files containing complex trace information from on of my projects.
I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.
If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.
The part that I have not figured out yet is how to quickly parse the XML and create such an index.
So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.
Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...:
Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc.