BZip2 file read in Hadoop

2019-04-13 08:51发布

问题:

I heard we can use multiple mappers to read different parts of one bzip2 file in parallel in Hadoop, to increase performance. But I cannot find related samples after search. Appreciate if anyone could point me to related code snippet. Thanks.

BTW: is gzip has the same feature (multiple mapper process different parts of one gzip file in parallel).

回答1:

If you look at: http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/30662, you will find that bzip2 format is indeed splittable and multiple mappers can work on one file. The patch was submitted at: https://issues.apache.org/jira/browse/HADOOP-4012. However, it seems it is available only above HADOOP 0.21.0.

From personal experience in order to use this technique of bzip2 there is nothing different that you need to do. hadoop should pick it up automatically depending on your min split size.

bzip2 compressed data by blocks and therefore it is possible to decompress it in blocks and send each block to a separate mapper. However, gzip does not have such a technique and therefore this cannot be sent to different mappers.



回答2:

You can look a pbzip2 for an example of parallel bz2 compression and decompression.

There is a parallel gzip as well, pigz. It does parallel compression, but not parallel decompression. The deflate format is not suited to parallel decompression. However you can either a) prepare a special gzip stream with resets of the history, or b) you can build an index into a gzip file on the first pass. Either way, you can then read different parts in parallel, or have more efficient random access.