Java XML Parser for huge files

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.

Any suggestions?

标签： java xml parsing

9条回答

等我变得足够好

2楼-- · 2019-01-04 14:35

Use a SAX based parser that presents you with the contents of the document in a stream of events.

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-01-04 14:37

StAX API is easier to deal with compared to SAX. Here is a short tutorial

0人赞添加讨论(0) 举报

Viruses.

4楼-- · 2019-01-04 14:45

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

0人赞添加讨论(0) 举报

Rolldiameter

5楼-- · 2019-01-04 14:47

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).

StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

0人赞添加讨论(0) 举报

疯言疯语

6楼-- · 2019-01-04 14:49

Stream the file into a SAX parser and read it into memory in chunks.

SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

0人赞添加讨论(0) 举报

该账号已被封号

7楼-- · 2019-01-04 14:49

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).

FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.

If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.

So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.

StringsFile file = new StringsFile();
StringInFile str = file.newString("abc");        // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

0人赞添加讨论(0) 举报

1 2 下一页

Java XML Parser for huge files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间