I'm trying to parse Wiktionary dumps on the fly, directly from the URL, in Java. The Wiki dumps are distributed as compressed BZIP2 files, and I am using the following approach to attempt to parse them:
String fileURL = "https://dumps.wikimedia.org/cswiktionary/20171120/cswiktionary-20171120-pages-articles-multistream.xml.bz2";
URL bz2 = new URL(fileURL);
BufferedInputStream bis = new BufferedInputStream(bz2.openStream());
CompressorInputStream input = new CompressorStreamFactory().createCompressorInputStream(bis);
BufferedReader br2 = new BufferedReader(new InputStreamReader(input));
System.out.println(br2.lines().count());
However, the outputted line count is only 36, which is only a fraction of the total file, seeing it's over 20MB in size. Attempting to print the stream line-by-line, only a few lines of XML were actually printed:
String line = br2.readLine();
while(line != null) {
System.out.println(line);
line = br2.readLine();
}
Is there something I am missing here? I copied my implementation almost line-for-line from other chunks of code I found online, which others claimed to have worked. Why isn't the entire stream being read? Thanks in advance.