According to this Cloudera post, Snappy IS splittable.
For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data.
But from the hadoop definitive guide, Snappy is NOT splittable.
There are also some confilitcting information on the web. Some say it's splittable, some say it's not.
Both are correct but in different levels.
According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/
This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.
To be more clear, is not the same:
than
Snappy blocks are NOT splittable but files with snappy blocks are splittables.
All splittable codecs in hadoop must implement
org.apache.hadoop.io.compress.SplittableCompressionCodec
. Looking at the hadoop source code as of 2.7, we seeorg.apache.hadoop.io.compress.SnappyCodec
does not implement this interface, so we know it is not splittable.I have just tested with Spark 1.6.2 on HDFS, for same number of workers/processor, between a simple JSON file and compressed by snappy:
Snappy file is created like this:
.saveAsTextFile("/user/qwant/benchmark_file_format/json_snappy", classOf[org.apache.hadoop.io.compress.SnappyCodec])
So Snappy is no splittable with Spark for JSON.
But, if you use parquet (or ORC) file format instead JSON, this will be splitable (even with gzip).