Is there a possibility for cloning an xmlTextReade

2019-08-07 04:39发布

问题:

I currently have to fix an existing application to use something other than the DOM interface of libxml2 because it turns out it gets passed XML files so large that they can't be loaded into memory.

I have rewritten the data loading from iterating over the DOM tree to using xmlTextReader for the most part now without too much problems. ( I use xmlNewTextReaderFilename to open a local file.)

It turns out however, that the subtree where the large data resides has to be read not in-order, but I have to collect some (small amount of) data before the other. (And the problem is exactly that it is this subtree that contains the large volume of data, so loading only this subtree into memory doesn't make much sense either.)

The easiest thing would be to just "clone" / "copy" my current reader, read ahead and then return to the original instance to continue reading there. (Seems I'm not the first one ... There's even something implemented on the C# side: XML Reader with Bookmarks.)

There doesn't appear to be any way however to "copy" the state of an xmlTextReader.

If I can't re-read part of a file, I could also re-read the whole file, which, although wasteful, would be OK here, but I still would need to remember where I was beforehand?

Is there maybe a simple way to remember for a xmlTextReader where it is in the current document, so that I can later find that position again when reading the document/file a second time?

Here's a problem example:

<root>
  <cat1>
    <data attrib="x1">
      ... here goes up to one GB in stuff ...
    </data>
    <data attrib="y2"> <!-- <<< Want to remember this position without having to re-read the stuff before -->
      ... even more stuff ...
    </data>
    <data attrib="z3">
       <!-- I need (part of) the data here to meaningfully interpret the data in [y2] that 
            came before. The best approach would seem to first skip all that data
            and then start back there at <data attrib="y2"> ... not having to re-read
            the whole [x1] data would be a big plus! -->
    </data>
  </cat1>
  ...
</root>

回答1:

I would like to give a workaround answer from what I learned at the XML mailing list:

There is no easy way to "clone" the state on an xmlReader, however what should be possible and should also be pretty easy is counting the reads one did on a document.

That is, to read a document with xmlReader, you have to probably invoke the following:

// looping ...
status = ::xmlTextReaderRead(pReader);

Provided you do that in a structured way (for example, I ended up writing a little wrapper class that encapsulates my usage pattern for xmlReader), it is then relatively easy to add a counter:

// looping ...
status = ::xmlTextReaderRead(pReader);
if (1 == status) { // success
  ++m_ReadCounter;
}

For re-reading a document (reaching a certain position), you then just call xmlTextReaderRead a number of m_ReadCounter times, discarding the results until you reach the position where you want to start again.

Yes, you have to re-parse the whole document, but that may be fast enough. (And may actually be better/faster than caching a very large volume part of the document.)