I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.
The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.
- My process analyses the xml files (extracts only a few nodes).
- Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).
Now I'm thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:
- Creating multiple parsers and running them in parallel on the xml sources.
- Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, ...)
- Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml - serial)
- Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency
I want to improve both, the performance overall and the "per file" performance.
Do you have experience with such problems? What is the best way to go?
I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case.
If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.
I believe that improving overall time is good enough for you. In this case read this tutorial:
http://download.oracle.com/javase/tutorial/essential/concurrency/index.html
then create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.
In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).
Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:
- Make sure you only create XMLInputFactory and XMLOutputFactory instances once
- Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.
The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.
Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.