I have the following problem:
I've got an XML file (approx 1GB), and have to iterate up and down (i.e. not sequential; one after the other) in order to get the required data and do some operations on it. Initially, I used the DOM Java package, but obviously, while parsing through the XML file, the JVM reaches its maximum heap space and halted.
In order to overcome this problem, one of the solutions I came up with, was to find another parser that iterates each element in the XML and then I store it's contents in a temporary SQLite Database on my Hard disk. Hence, in this way, the JVM's heap is not exceeded, and once all data is filled, I ignore the XML file and continue my operations on the temporary SQLite Database.
Is there another way how I can tackle my problem in hand?
If you don't want to be bound by the memory limits, I certainly recommend you to use your current approach, and store everything in database.
The parsing of the XML file should be done by a
SAX parser
, as everybody has recommended (including me). This way you can create one object at a time, and you can immediately persist it into the database.For the post-processing (resolving cross-references), you can use
SELECT
s from the database, make primary keys, indexes, etc. You can use ORM (Eclipselink, Hibernate) as well if you feel comfortable with that.Actually I don't really recommend SQLite, it's easier to set up a MySQL server, and store the data there. Later you can even reuse the XML data (if you don't delete).
SAX (Simple API for XML) will help you here.
Here is an example implementation:
Where in
MyHandler
you define the actions to be taken when events like start/end of document/element are generated.if you require a resource friendly approach to handle very large xml try this: http://www.xml2java.net/xml-to-java-data-binding-for-big-data/ it allows you to process data in a SAX way, but with the advantage of getting high level events (xml data mapped onto java) and being able to work with these objects in your code directly. so it combines jaxb convenience and SAX resource friendlyness.
If you want to use a higher-level approach than SAX, which can be very tricky to program, you could look at streaming XSLT transformations using a recent Saxon-EE release. However, you've been too vague about the precise processing that you are doing to know whether this will work for your particular case.