Question
See updated question in edit section below
I'm trying to decompress large (~300M) GZIPed files from Amazon S3 on the fly using GZIPInputStream but it only outputs a portion of the file; however, if I download to the filesystem before decompression then GZIPInputStream will decompress the entire file.
How can I get GZIPInputStream to decompress the entire HTTPInputStream and not just the first part of it?
What I've Tried
see update in the edit section below
I suspected a HTTP problem except that no exceptions are ever thrown, GZIPInputStream returns a fairly consistent chunk of the file each time and, as far as I can tell, it always breaks on a WET record boundary although the boundary it picks is different for each URL (which is very strange as everything is being treated as a binary stream, no parsing of the WET records in the file is happening at all.)
The closest question I could find is GZIPInputStream is prematurely closed when reading from s3 The answer to that question was that some GZIP files are actually multiple appended GZIP files and GZIPInputStream doesn't handle that well. However, if that is the case here why would GZIPInputStream work fine on a local copy of the file?
Demonstration Code and Output
Below is a piece of sample code that demonstrates the problem I am seeing. I've tested it with Java 1.8.0_72 and 1.8.0_112 on two different Linux computers on two different networks with similar results. I expect the byte count from the decompressed HTTPInputStream to be identical to the byte count from the decompressed local copy of the file, but the decompressed HTTPInputStream is much smaller.
OutputTesting URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 87894 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 1772936 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 89217 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
Sample Code
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;
public class GZIPTest {
public static void main(String[] args) throws Exception {
// Our three test files from CommonCrawl
URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");
/*
* Test the URLs and display the results
*/
test(url0, "testfile0.wet");
System.out.println("------");
test(url40, "testfile40.wet");
System.out.println("------");
test(url500, "testfile500.wet");
}
public static void test(URL url, String testGZFileName) throws Exception {
System.out.println("Testing URL "+url.toString());
// First directly wrap the HTTPInputStream with GZIPInputStream
// and count the number of bytes we read
// Go ahead and save the extracted stream to a file for further inspection
System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
int bytesFromGZIPDirect = 0;
URLConnection urlConnection = url.openConnection();
FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
// FIRST TEST - Decompress from HTTPInputStream
GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream());
byte[] buffer = new byte[1024];
int bytesRead = -1;
while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPDirect += bytesRead;
directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
}
gzipishttp.close();
directGZIPOutStream.close();
// Now save the GZIPed file locally
System.out.println("Testing saving to file before decompression");
int bytesFromGZIPFile = 0;
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
outputStream.close();
// SECOND TEST - decompress from FileInputStream
GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));
buffer = new byte[1024];
bytesRead = -1;
while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPFile += bytesRead;
}
gzipis.close();
// The Results - these numbers should match but they don't
System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
}
}
Edit
Closed Stream and associated Channel in demonstration code as per comment by @VGR.
UPDATE:
The problem does seem to be something specific to the file. I pulled the Common Crawl WET archive down locally (wget), uncompressed it (gunzip 1.8), then recompressed it (gzip 1.8) and re-uploaded to S3 and the on-the-fly decompression then worked fine. You can see the test if you modify the sample code above to include the following lines:
// Original file from CommonCrawl hosted on S3
URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
// Recompressed file hosted on S3
URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
test(originals3, "originalhost.txt");
test(rezippeds3, "rezippedhost.txt");
URL rezippeds3 points to the WET archive file that I downloaded, decompressed and recompressed, then re-uploaded to S3. You will see the following output:
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 7212400 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file originals3.txt
-----
Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file rezippeds3.txt
As you can see once the file was recompressed I was able to stream it through GZIPInputStream and get the entire file. The original file still shows the usual premature end of the decompression. When I downloaded and uploaded the WET file without recompressing it I got the same incomplete streaming behavior so it was definitely the recompression that fixed it. I also put both files, the original and the recompressed, onto a traditional Apache web server and was able to replicate the results, so S3 doesn't seem to have anything to do with the problem.
So. I have a new question.
New Question
Why would a FileInputStream behave differently than a HTTPInputStream when reading the same content. If it is the exact same file why does:
new GZIPInputStream(urlConnection.getInputStream());
behave any differently than
new GZIPInputStream(new FileInputStream("./test.wet.gz"));
?? Isn't an input stream just an input stream??