I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.
Below is the code snippet i tried.
DataFrame dataframe =
hiveContext.createDataFrame(rowData, schema);
dataframe.write().format("orc").options(new HashMap(){
{
put("orc.compress","SNAPPY");
put("hive.exec.orc.default.compress","SNAPPY");
put("orc.compress.size","524288");
put("hive.exec.orc.default.buffer.size","524288");
put("hive.exec.orc.compression.strategy", "COMPRESSION");
}
}).save("spark_orc_output");
Apart from this, i tried these properties set in hive-site.xml and hiveContext object also.
hive --orcfiledump on output confirms that the configurations not applied. Orcfiledump snippet is below.
Compression: ZLIB
Compression size: 262144
You are making two different errors here. I don't blame you; I've been there...
Issue #1
orc.compress
and the rest are not Spark DataFrameWriter
options. They are Hive configuration properties, that must be defined before creating the hiveContext
object...
- either in the
hive-site.xml
available to Spark at launch time
- or in your code, by re-creating the
SparkContext
...
sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
sc.stop
val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)
[Edit] with Spark 2.x the script would become...
spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
spark.close
val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy
Issue #2
Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.
There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext
.
For ORC and the other formats, you have to resort to format-specific DataFrameWriter
options; quoting the latest JavaDoc...
You can set the following ORC-specific option(s) for writing ORC
files:
• compression
(default snappy
): compression codec to use when
saving to file. This can be one of the known case-insensitive shorten
names (none
, snappy
, zlib
, and lzo
). This will override orc.compress
Note that the default compression codec has changed with Spark 2; before that it was zlib
So the only thing you can set is the compression codec, using
dataframe.write().format("orc").option("compression","snappy").save("wtf")