Is Snappy splittable or not splittable?

According to this Cloudera post, Snappy IS splittable.

For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data.

But from the hadoop definitive guide, Snappy is NOT splittable.

There are also some confilitcting information on the web. Some say it's splittable, some say it's not.

标签： hadoop snappy

3条回答

萌系小妹纸

2楼-- · 2020-05-19 07:59

Both are correct but in different levels.

According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

One thing to note is that Snappy is intended to be used with a
container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

To be more clear, is not the same:

<START-FILE>
  <START-SNAPPY-BLOCK>
     FULL CONTENT
  <END-SNAPPY-BLOCK>
<END-FILE>

than

<START-FILE>
  <START-SNAPPY-BLOCK1>
     RECORD1
  <END-SNAPPY-BLOCK1>
  <START-SNAPPY-BLOCK2>
     RECORD2
  <END-SNAPPY-BLOCK2>
  <START-SNAPPY-BLOCK3>
     RECORD3
  <END-SNAPPY-BLOCK3>
<END-FILE>

Snappy blocks are NOT splittable but files with snappy blocks are splittables.

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2020-05-19 08:03

All splittable codecs in hadoop must implement org.apache.hadoop.io.compress.SplittableCompressionCodec. Looking at the hadoop source code as of 2.7, we see org.apache.hadoop.io.compress.SnappyCodec does not implement this interface, so we know it is not splittable.

0人赞添加讨论(0) 举报

叼着烟拽天下

4楼-- · 2020-05-19 08:19

I have just tested with Spark 1.6.2 on HDFS, for same number of workers/processor, between a simple JSON file and compressed by snappy:

JSON: 4 files of 12GB each, Spark creates 388 tasks (1 task by HDFS block) (4*12GB/128MB => 384)
Snappy: 4 files of 3GB each, Spark creates 4 tasks

Snappy file is created like this: .saveAsTextFile("/user/qwant/benchmark_file_format/json_snappy", classOf[org.apache.hadoop.io.compress.SnappyCodec])

So Snappy is no splittable with Spark for JSON.

But, if you use parquet (or ORC) file format instead JSON, this will be splitable (even with gzip).

0人赞添加讨论(0) 举报

Is Snappy splittable or not splittable?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间