I need to process a huge XML file, 4G. I used dom4j SAX, but wrote my own DefaultElementHandler. Code framework as below:
SAXParserFactory sf = SAXParserFactory.newInstance();
SAXParser sax = sf.newSAXParser();
sax.parse("english.xml", new DefaultElementHandler("page"){
public void processElement(Element element) {
// process the element
}
});
I thought I was processing the huge file "page" by "page". But it seems not, as I always had the outof memory error. Did I miss anything important? Thanks. I am new to XML process.
Your DefaultElement implementation looks confused to me. It looks like everything is piling up in sBuilder and it never gets cleared until it finds the end of the root element, or more likely, runs out of memory.
How to read in the element text depends on what kind of xml you need to parse. Each element can have text and it can be interspersed with child elements. Generally there is the kind of xml that you see in web services and config files, where all the element text is in the leaf elements, then there are cases, like XHTML, where the interspersing thing is going on.
If the way the schema of your xml works is that all the text information is in the leaf elements, then you can buffer the text you get starting with startElement, and use the accumulated text in endElement, then clear the buffer.
Here's a good article on SAX: http://www.javaworld.com/javaworld/jw-08-2000/jw-0804-sax.html
Well you don't really process XML by the page, however if you extend XMLFilterImpl instead of using the DefaultElementHandler (whatever that is), then you can simply process the XML elements as they come. You will be streaming so there will be no case when the entire document is in memory (as a practical matter).
You will essentially get called for event element, at the start of the element, for the attributes, for the text within, and then at the end of the element (look at the methods in the ContentHandler interface). Based on these calls you do your processing (you will probably need to have some data structures where you accumulate the elements inside of your "page" element. Also note that there is no guarantee that you will get only one call for the text (it's up to the parser).
Does this help make it more clear?
I think it only read all the content within the element, as I just followed an example online...
public abstract class DefaultElementHandler extends DefaultHandler{
private boolean begin;
private String tagName;
private StringBuilder sBuilder;
public DefaultElementHandler(String tagName) {
this.tagName = tagName;
this.begin = false;
this.sBuilder = new StringBuilder();
}
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equals(tagName)||begin){
sBuilder.append("<");
sBuilder.append(qName);
sBuilder.append(" ");
int attrCount = attributes.getLength();
for (int i=0; i<attrCount; i++) {
sBuilder.append(attributes.getQName(i));
sBuilder.append("=\"");
sBuilder.append(attributes.getValue(i));
sBuilder.append("\" ");
}
sBuilder.append(">");
begin = true;
}
}
public void characters(char[] ch, int start, int length) throws SAXException{
StringBuilder sb = new StringBuilder();
for(int i=0; i < length; i++) {
sb.append(convertSpecialChar(ch[start+i]));
}
String text = sb.toString().trim();
//String text = new String(convertSpecialChar(ch), start, length);
if (text.trim().equals("")) return;
if (begin) sBuilder.append(text);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
String stag = "</" + tagName + ">";
String ntag = "</" + qName + ">";
if (stag.equals(ntag) || begin) {
sBuilder.append(ntag);
if (stag.equals(ntag)) {
begin = false;
try {
Document doc = DocumentHelper.parseText(sBuilder.toString());
Element element = doc.getRootElement();
this.processElement(element);
} catch (DocumentException e) {
e.printStackTrace();
System.exit(1);
}
sBuilder.setLength(0);
}
}
}