I'm playing with xproc, the XML pipeline language and http://xmlcalabash.com/. I'd like to find an example for streaming large xml documents.
for example, given the following huge xml document:
<Books>
<Book>
<title>Book-1</title>
</Book>
<Book>
<title>Book-2</title>
</Book>
<Book>
<title>Book-3</title>
</Book>
<!-- many many.... -->
<Book>
<title>Book-N</title>
</Book>
</Books>
How should I proceed to loop (streaming) over x->N documents like
<Books>
<Book>
<title>Book-x</title>
</Book>
</Books>
and treat each document with a xslt ? is it possible with xproc ?
You should have a look to QuiXProc ( http://code.google.com/p/quixproc ) that is an implementation of XProc based on Calabash that added Streaming and Parallel processing
Hope this helps.
Here is how you could do it with XProc that would stream with QuiXProc
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0">
<p:load href="hugedocument.xml"/>
<p:for-each>
<p:iteration-source select="/Books/Book"/>
<p:xslt>
<p:input port="stylesheet">
<p:document href="book.xsl"/>
</p:input>
<p:input port="parameters">
<p:empty/>
</p:input>
</p:xslt>
</p:for-each>
<p:wrap-sequence wrapper="Books"/>
<p:store href="hugedocument.res.xml"/>
</p:declare-step>
I remember a recent discussion on the XProc Dev list related to streaming. It seems that Calabash does not attempt streaming, see Norman Walsh message here.
Saxon SA, supports streaming for XSLT and XQuery, for details see:
http://www.saxonica.com/documentation/sourcedocs/serial.html
Yes, much as I'd like to support streaming, my real goals for XML Calabash were completeness and correctness.
I have some ideas for reworking the internals of XML Calabash to use more of the push/pull streaming features of Saxon, but there are a lot of other things on my "todo" list too :-/
EMC's Calumet (http://developer.emc.com/xmltech) doesn't do streaming either. The main focus until now has been compliance with the XProc specification together with integrability with other our XML-related tools, such as the xDB native XML database. Support for streaming is on my radar, although I can't tell when I will be able to get to that right now.
Even though most XProc processors don't stream data between steps, this doesn't necessarily have to mean that your case won't work (e.g. will explode in terms of memory usage for instance). It depends on what you want to do with the result of the XSLT step.
If you are gathering the results, trying to build one big output file, then yes, this may be a problem. But in that case you might be better off with a streaming solution (SAX, STaX, JOOST parser, ..) anyhow.
If you will be storing the results of each XSLT separately, then the problem will be much less. You would only need to be concerned whether you have sufficient memory available to load the initial document, and do processing on each document. Not sure how well Saxon underneath XMLCalabash would behave, but I expect that a size of upto 50 megabyte won't have to be a very big issue..
Cheers