Parsing large XML documents in JAVA

2020-02-08 17:50发布

I have the following problem:

I've got an XML file (approx 1GB), and have to iterate up and down (i.e. not sequential; one after the other) in order to get the required data and do some operations on it. Initially, I used the DOM Java package, but obviously, while parsing through the XML file, the JVM reaches its maximum heap space and halted.

In order to overcome this problem, one of the solutions I came up with, was to find another parser that iterates each element in the XML and then I store it's contents in a temporary SQLite Database on my Hard disk. Hence, in this way, the JVM's heap is not exceeded, and once all data is filled, I ignore the XML file and continue my operations on the temporary SQLite Database.

Is there another way how I can tackle my problem in hand?

4条回答
萌系小妹纸
2楼-- · 2020-02-08 18:34

If you don't want to be bound by the memory limits, I certainly recommend you to use your current approach, and store everything in database.

The parsing of the XML file should be done by a SAX parser, as everybody has recommended (including me). This way you can create one object at a time, and you can immediately persist it into the database.

For the post-processing (resolving cross-references), you can use SELECTs from the database, make primary keys, indexes, etc. You can use ORM (Eclipselink, Hibernate) as well if you feel comfortable with that.

Actually I don't really recommend SQLite, it's easier to set up a MySQL server, and store the data there. Later you can even reuse the XML data (if you don't delete).

查看更多
劳资没心,怎么记你
3楼-- · 2020-02-08 18:40

SAX (Simple API for XML) will help you here.

Unlike the DOM parser, the SAX parser does not create an in-memory representation of the XML document and so is faster and uses less memory. Instead, the SAX parser informs clients of the XML document structure by invoking callbacks, that is, by invoking methods on a org.xml.sax.helpers.DefaultHandler instance provided to the parser.

Here is an example implementation:

SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
DefaultHandler handler = new MyHandler();
parser.parse("file.xml", handler);

Where in MyHandler you define the actions to be taken when events like start/end of document/element are generated.

class MyHandler extends DefaultHandler {

    @Override
    public void startDocument() throws SAXException {
    }

    @Override
    public void endDocument() throws SAXException {
    }

    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes attributes) throws SAXException {
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
    }

    // To take specific actions for each chunk of character data (such as
    // adding the data to a node or buffer, or printing it to a file).
    @Override
    public void characters(char ch[], int start, int length)
            throws SAXException {
    }

}
查看更多
Summer. ? 凉城
4楼-- · 2020-02-08 18:41

if you require a resource friendly approach to handle very large xml try this: http://www.xml2java.net/xml-to-java-data-binding-for-big-data/ it allows you to process data in a SAX way, but with the advantage of getting high level events (xml data mapped onto java) and being able to work with these objects in your code directly. so it combines jaxb convenience and SAX resource friendlyness.

查看更多
手持菜刀,她持情操
5楼-- · 2020-02-08 18:42

If you want to use a higher-level approach than SAX, which can be very tricky to program, you could look at streaming XSLT transformations using a recent Saxon-EE release. However, you've been too vague about the precise processing that you are doing to know whether this will work for your particular case.

查看更多
登录 后发表回答