XML streaming with XProc

2019-04-14 19:35发布

问题:

I'm playing with xproc, the XML pipeline language and http://xmlcalabash.com/. I'd like to find an example for streaming large xml documents. for example, given the following huge xml document:

<Books>
 <Book>
   <title>Book-1</title>
 </Book>
 <Book>
   <title>Book-2</title>
 </Book>
 <Book>
   <title>Book-3</title>
 </Book>

<!-- many many.... -->
 <Book>
   <title>Book-N</title>
 </Book>
</Books>

How should I proceed to loop (streaming) over x->N documents like

<Books>
 <Book>
   <title>Book-x</title>
 </Book>
</Books>

and treat each document with a xslt ? is it possible with xproc ?

回答1:

You should have a look to QuiXProc ( http://code.google.com/p/quixproc ) that is an implementation of XProc based on Calabash that added Streaming and Parallel processing Hope this helps.



回答2:

Here is how you could do it with XProc that would stream with QuiXProc

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0">
  <p:load href="hugedocument.xml"/>
  <p:for-each>
    <p:iteration-source select="/Books/Book"/>
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="book.xsl"/>
      </p:input>
      <p:input port="parameters">
        <p:empty/>
      </p:input>
    </p:xslt>
  </p:for-each>
  <p:wrap-sequence wrapper="Books"/>    
  <p:store href="hugedocument.res.xml"/>
</p:declare-step>


回答3:

I remember a recent discussion on the XProc Dev list related to streaming. It seems that Calabash does not attempt streaming, see Norman Walsh message here.

Saxon SA, supports streaming for XSLT and XQuery, for details see: http://www.saxonica.com/documentation/sourcedocs/serial.html



回答4:

Yes, much as I'd like to support streaming, my real goals for XML Calabash were completeness and correctness.

I have some ideas for reworking the internals of XML Calabash to use more of the push/pull streaming features of Saxon, but there are a lot of other things on my "todo" list too :-/



回答5:

EMC's Calumet (http://developer.emc.com/xmltech) doesn't do streaming either. The main focus until now has been compliance with the XProc specification together with integrability with other our XML-related tools, such as the xDB native XML database. Support for streaming is on my radar, although I can't tell when I will be able to get to that right now.



回答6:

Even though most XProc processors don't stream data between steps, this doesn't necessarily have to mean that your case won't work (e.g. will explode in terms of memory usage for instance). It depends on what you want to do with the result of the XSLT step.

If you are gathering the results, trying to build one big output file, then yes, this may be a problem. But in that case you might be better off with a streaming solution (SAX, STaX, JOOST parser, ..) anyhow.

If you will be storing the results of each XSLT separately, then the problem will be much less. You would only need to be concerned whether you have sufficient memory available to load the initial document, and do processing on each document. Not sure how well Saxon underneath XMLCalabash would behave, but I expect that a size of upto 50 megabyte won't have to be a very big issue..

Cheers