How to read large XML file consisting of large num

2019-07-04 10:59发布

问题:

I have a large XML file that consists of relatively fixed size items i.e.

<rootElem>
  <item>...</item>

  <item>...</item>
  <item>...</item>
<rootElem>

The item elements are relatively shallow and typically rather small ( <100 KB), but there may be a lot of them (hundreds of thousands). The items are completely independent of each other.

How could I process the file efficiently in Java? I can't read the whole file in as DOM, and I don't like to use SAX because the code gets rather complex. I'd like to avoid splitting the file to smaller pieces.

Optimal would be if I could obtain each item element, one at a time, as a separate DOM document, that I could process using tools like JAXB. Basically I just want to loop once over all the items.

I would think that this is a rather common problem.

回答1:

Java 6 has a StAX support. It perfroms a stream processing like SAX, but uses a pull-based approach which leads to the simplier handling code.



回答2:

When the input is large, sequential (a.k.a. stream) processing of the document is generally what's called for. It's true that SAX can become a bit messy (or at least require a fair bit of code) because you basically have to build a state machine doing the extraction. If you look for XML pull parsers rather than event based implementations, you may at least find this approach slightly simpler to work with.

Your idea to extract the contents of the item elements is possible as well, using SAX for the first step, and may strike an acceptable balance between using event/pull parsing and the flexibility of full DOM access. (It will still be way slower than event/pull parsing, doing heavy allocation, but at least the requirement to keep it all in memory at the same time is lifted.)



回答3:

I have not tried that, but... If your XML files have always the same format, you could parse them yourself with BufferedReader, looking for <item> tags, and store the item content in a StringBuffer. You could then parse each string (including item as the root) with a DOM parser, and process it. You need only one DocumentBuilder for all the items.

The advantage of the method is that you would parse the file quickly without any memory issue, and have the convenience of a DOM tree. The drawback is that you would not have a real XML parsing: if the XML is not exactly what you expect (is <item/> possible ?), your program might crash.

The problem here is that you need to treat some XML elements (the ones inside the items) as if they were not XML elements when you first parse the file. If you could find another way to do that, you could use SAX to parse the file, get the item content as strings in a safe way, and parse the items with a DOM parser as described above.

I guess another option would be to use SAX or StAX and create DOM trees for the items based on the related events. But it might be complex if there are many elements in the language.



回答4:

Using DOM, i have an efficient way of parsing xml.I had prepared this DOM parser by myself, using recursion which will parse your xml without having knowledge of single tag. It will give you each node's text content if exist, in a sequence. You can remove commented section in following code to get node name also. Hope it would help.

import java.io.BufferedWriter;
import java.io.File;  
import java.io.FileInputStream;  
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;

 import javax.xml.parsers.DocumentBuilder;  
 import javax.xml.parsers.DocumentBuilderFactory;  
 import org.w3c.dom.Document;  
 import org.w3c.dom.Node;  
 import org.w3c.dom.NodeList;  



public class RecDOMP {


public static void main(String[] args) throws Exception{
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();  
        dbf.setValidating(false); 
        DocumentBuilder db = dbf.newDocumentBuilder();   

// replace following  path with your input xml path  
         Document doc = db.parse(new FileInputStream(new File  ("D:\\ambuj\\input.xml")));  

// replace following  path with your output xml path 
         File OutputDOM = new File("D:\\ambuj\\outapip1.txt");
            FileOutputStream fostream = new FileOutputStream(OutputDOM);
            OutputStreamWriter oswriter = new OutputStreamWriter (fostream);
            BufferedWriter bwriter = new BufferedWriter(oswriter);

            // if file doesnt exists, then create it
            if (!OutputDOM.exists()) {
                OutputDOM.createNewFile();}


            visitRecursively(doc,bwriter);
            bwriter.close(); oswriter.close(); fostream.close();

            System.out.println("Done");
}
public static void visitRecursively(Node node, BufferedWriter bw) throws IOException{  

             // get all child nodes  
         NodeList list = node.getChildNodes();                                  
         for (int i=0; i<list.getLength(); i++) {          
                 // get child node              
       Node childNode = list.item(i);  
       if (childNode.getNodeType() == Node.TEXT_NODE)
       {
   //System.out.println("Found Node: " + childNode.getNodeName()           
    //   + " - with value: " + childNode.getNodeValue()+" Node type:"+childNode.getNodeType()); 

   String nodeValue= childNode.getNodeValue();
   nodeValue=nodeValue.replace("\n","").replaceAll("\\s","");
   if (!nodeValue.isEmpty())
   {
       System.out.println(nodeValue);
       bw.write(nodeValue);
       bw.newLine();
   }
       }
       visitRecursively(childNode,bw);  

            }         

     }  

}