Hadoop, how to compress mapper output but not the

2019-01-30 08:27发布

问题:

I have a map-reduce java program in which I try to only compress the mapper output but not the reducer output. I thought that this would be possible by setting the following properties in the Configuration instance as listed below. However, when I run my job, the generated output by the reducer still is compressed since the file generated is: part-r-00000.gz. Has anyone successfully just compressed the mapper data but not the reducer? Is that even possible?

//Compress mapper output

conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);

回答1:

With MR2, now we should set

conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)

For more details, refer: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml



回答2:

mapred.compress.map.output: Is the compression of data between the mapper and the reducer. If you use snappy codec this will most likely increase read write speed and reduce network overhead. Don't worry about spitting here. These files are not stored in hdfs. They are temp files that exist only for the map reduce job.

mapred.map.output.compression.codec: I would use snappy

mapred.output.compress: This boolean flag will define is the whole map/reduce job will output compressed data. I would always set this to true also. Faster read/write speeds and less disk spaced used.

mapred.output.compression.type: I use block. This will make the compression splittable even for all compression formats (gzip, snappy, and bzip2) just make sure you're using a splitable file format like sequence, RCFile, or Avro.

mapred.output.compression.codec: this is the compression codec for the map/reduce job. I mostly use one of the three: Snappy (Fastest r/w 2x-3x compression), gzip (normal r fast w 5x-8x compression), bzip2 (slow r/w 8x-12x compression)

Also remember when compression mapred output, that because of splitting compression will differ base on your sorting order. The close like data is together the better the compression.



回答3:

"output compression" will compress your final output. To compress map-outputs only, use something like this:

  conf.set("mapred.compress.map.output", "true")
  conf.set("mapred.output.compression.type", "BLOCK"); 
  conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec"); 


回答4:

  1. You need to set "mapred.compress.map.output" to true.
  2. Optionally you can choose your compression codec by setting "mapred.map.output.compression.codec". NOTE1: mapred output compression should never be BLOCK. See the following JIRA for detail: https://issues.apache.org/jira/browse/HADOOP-1194 NOTE2: GZIP and BZ2 are CPU intensive. If you have slow network and GZIP or BZ2 gives better compression ratio, it may justify the spending of CPU cycles. Otherwise, consider LZO or Snappy codec.
    NOTE3: if you want to use map output compression, consider install the native codec which is invoked via JNI and gives you better performance.


回答5:

If you use MapR's distribution for Hadoop, you can get the benefits of compression without all the folderol with the codecs.

MapR compresses natively at the file system level so that the application doesn't need to know or care. Compression can be turned on or off at the directory level so you can compress inputs, but not outputs or whatever you like. Generally, the compression is so fast (it uses an algorithm similar to snappy by default) that most applications see a performance boost when using native compression. If your files are already compressed, that is detected very quickly and compression is turned off automatically so you don't see a penalty there, either.