Why can't I seem to read an entire compressed

2019-08-19 03:20发布

问题:

I'm trying to parse Wiktionary dumps on the fly, directly from the URL, in Java. The Wiki dumps are distributed as compressed BZIP2 files, and I am using the following approach to attempt to parse them:

String fileURL = "https://dumps.wikimedia.org/cswiktionary/20171120/cswiktionary-20171120-pages-articles-multistream.xml.bz2";
URL bz2 = new URL(fileURL);
BufferedInputStream bis = new BufferedInputStream(bz2.openStream());
CompressorInputStream input = new CompressorStreamFactory().createCompressorInputStream(bis);
BufferedReader br2 = new BufferedReader(new InputStreamReader(input));
System.out.println(br2.lines().count());

However, the outputted line count is only 36, which is only a fraction of the total file, seeing it's over 20MB in size. Attempting to print the stream line-by-line, only a few lines of XML were actually printed:

String line = br2.readLine();
while(line != null) {
  System.out.println(line);
  line = br2.readLine();
}

Is there something I am missing here? I copied my implementation almost line-for-line from other chunks of code I found online, which others claimed to have worked. Why isn't the entire stream being read? Thanks in advance.

回答1:

So as it turns out, I was just being dumb. Wiktionary BZIP2 files are explicitly multistream (it even says so in the filename), and as a result, only one stream was being read in using the vanilla Commons Compress classes. You need a multistream reader in order to read multistream files, and from the looks of things, you have to write one yourself. I happened across the following implementation which worked for me:

https://chaosinmotion.blog/2011/07/29/and-another-curiosity-multi-stream-bzip2-files/

Hope this helps someone in the future :)