parsing a big big not well formed file with Java

2019-04-02 03:13发布

I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:

<docFileNo_1>
<otherItems></otherItems>

<html>
<div=XXXpostag>
</html>

</docFileNo>
   ... others doc... 
<docFileNo_N>
<otherItems></otherItems>

<html>
<div=XXXpostag>
</html>

</docFileNo>

Surfing the net i have read about some people that have encountered problem to manage files, but they suggest to me, to map a file with NIO. So i think that the solution is too expansive and could bring me thrown an exception. So i think that my problem is to resolve 2 doutbs:

  1. How to read efficiently in time the 3 GB text file
  2. How to parser efficiently the html extract from the docFileNoxx, and apply rules to the html's tag to extract the post of the tag.

So.. I have try to resolve the first question on this way:

  1. _reader = new BufferedReader(new FileReader(filePath)) // create a buffer reader of file
  2. _currentLine = _reader.readLine(); // i iterate the file reading it line by line
  3. For every line, i append the lines to a String variable until encounter the tag
  4. so with JSOUP and post CSS filter i extract the content, and put it on file.

Well the process of extraction of 25 MB, on average takes about 88 seconds.... So i would like to perform it.

HOw I could perform my extraction??

3条回答
beautiful°
2楼-- · 2019-04-02 03:34

For large XML files it is best to use a SAX style parser, these don't attempt to build a document object model in memory for the whole XML file. I wouldn't try to read the XML file line by line, I'd call an appropriate method in the SAX implementation. Oracle have a tutorial

查看更多
该账号已被封号
3楼-- · 2019-04-02 03:44

You may be able to speed up the process if your problem is the disc io part by using a BufferedInputStream with a large buffer - e.g. 256KB in the following example:

InputStream in = new BufferedInputStream(new FileInputStream(filePath),256*1024)));
new BufferedReader(new InputStreamReader(in));

If the problem is the CPU and you have a multi-core machine you can try to move work into a separate thread.

查看更多
趁早两清
4楼-- · 2019-04-02 03:58

Whatever you do, don't do (pseudo code):

String data = "";
for line in file {
    data += line;
}

but use a StringBuilder:

StringBuilder data = new StringBuilder();
for line in file {
    data.append(line);
}
return data.toString();

Further, consider walking through the file and create a map with only the interesting parts. I assume you don't have XML but something which only looks a bit like it, and the example you gave is a fair representation of the content.

Map<String, String> entries = new HashMap<String,String>(1000);
StringBuilder entryData = null;
for line in file {
  if line starts with "<docFileNo" {
     docFileNo = extract number from line;
  } else if line starts with "<div=XXXpostag>" {
     // Content of this entry starts here
     entryData = new StringBuilder();
  } else if line starts with "</html>" {
     // content of this entry ends here
     // so store content, and indicate that the entry is finished by 
     // setting data to null
     entries.put(docFileNo, entryData.toString);
     entryData = null;
  } else if entryData is not null {
     // we're in an entry as data is not null, so store the line
     entryData.append(line);
  }
}

The map contains only entry-sized strings which makes them a bit easier to handle. I think you'd need to adapt it to the true data, but this is something which you could test in about half an hour.

The clue is entryData. it is not only the StringBuilder in which the data of 1 entry is build, but if not-null it also indicates we saw a start entry marker (the div) and if null we saw the end marker (</html>) indicating the next lines need not be stored.

I assumed you want to keep the doc number, and the XXXposttag is constant.

An alternative implementation of this logic could be made using the Scanner class.

查看更多
登录 后发表回答