I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:
<docFileNo_1>
<otherItems></otherItems>
<html>
<div=XXXpostag>
</html>
</docFileNo>
... others doc...
<docFileNo_N>
<otherItems></otherItems>
<html>
<div=XXXpostag>
</html>
</docFileNo>
Surfing the net i have read about some people that have encountered problem to manage files, but they suggest to me, to map a file with NIO. So i think that the solution is too expansive and could bring me thrown an exception. So i think that my problem is to resolve 2 doutbs:
- How to read efficiently in time the 3 GB text file
- How to parser efficiently the html extract from the docFileNoxx, and apply rules to the html's tag to extract the post of the tag.
So.. I have try to resolve the first question on this way:
- _reader = new BufferedReader(new FileReader(filePath)) // create a buffer reader of file
- _currentLine = _reader.readLine(); // i iterate the file reading it line by line
- For every line, i append the lines to a String variable until encounter the tag
- so with JSOUP and post CSS filter i extract the content, and put it on file.
Well the process of extraction of 25 MB, on average takes about 88 seconds.... So i would like to perform it.
HOw I could perform my extraction??
For large XML files it is best to use a SAX style parser, these don't attempt to build a document object model in memory for the whole XML file. I wouldn't try to read the XML file line by line, I'd call an appropriate method in the SAX implementation. Oracle have a tutorial
You may be able to speed up the process if your problem is the disc io part by using a BufferedInputStream with a large buffer - e.g. 256KB in the following example:
If the problem is the CPU and you have a multi-core machine you can try to move work into a separate thread.
Whatever you do, don't do (pseudo code):
but use a StringBuilder:
Further, consider walking through the file and create a map with only the interesting parts. I assume you don't have XML but something which only looks a bit like it, and the example you gave is a fair representation of the content.
The map contains only entry-sized strings which makes them a bit easier to handle. I think you'd need to adapt it to the true data, but this is something which you could test in about half an hour.
The clue is entryData. it is not only the StringBuilder in which the data of 1 entry is build, but if not-null it also indicates we saw a start entry marker (the div) and if null we saw the end marker
(</html>)
indicating the next lines need not be stored.I assumed you want to keep the doc number, and the XXXposttag is constant.
An alternative implementation of this logic could be made using the Scanner class.