I currently have to fix an existing application to use something other than the DOM interface of libxml2 because it turns out it gets passed XML files so large that they can't be loaded into memory.
I have rewritten the data loading from iterating over the DOM tree to using xmlTextReader for the most part now without too much problems. ( I use xmlNewTextReaderFilename
to open a local file.)
It turns out however, that the subtree where the large data resides has to be read not in-order, but I have to collect some (small amount of) data before the other. (And the problem is exactly that it is this subtree that contains the large volume of data, so loading only this subtree into memory doesn't make much sense either.)
The easiest thing would be to just "clone" / "copy" my current reader, read ahead and then return to the original instance to continue reading there. (Seems I'm not the first one ... There's even something implemented on the C# side: XML Reader with Bookmarks.)
There doesn't appear to be any way however to "copy" the state of an xmlTextReader.
If I can't re-read part of a file, I could also re-read the whole file, which, although wasteful, would be OK here, but I still would need to remember where I was beforehand?
Is there maybe a simple way to remember for a xmlTextReader where it is in the current document, so that I can later find that position again when reading the document/file a second time?
Here's a problem example:
<root>
<cat1>
<data attrib="x1">
... here goes up to one GB in stuff ...
</data>
<data attrib="y2"> <!-- <<< Want to remember this position without having to re-read the stuff before -->
... even more stuff ...
</data>
<data attrib="z3">
<!-- I need (part of) the data here to meaningfully interpret the data in [y2] that
came before. The best approach would seem to first skip all that data
and then start back there at <data attrib="y2"> ... not having to re-read
the whole [x1] data would be a big plus! -->
</data>
</cat1>
...
</root>