Unable to configure ORC properties in Spark

2019-02-11 03:11发布

问题:

I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.

Below is the code snippet i tried.

 DataFrame dataframe =
                hiveContext.createDataFrame(rowData, schema);
dataframe.write().format("orc").options(new HashMap(){
            {

                put("orc.compress","SNAPPY");
                put("hive.exec.orc.default.compress","SNAPPY");

                put("orc.compress.size","524288");
                put("hive.exec.orc.default.buffer.size","524288");


                put("hive.exec.orc.compression.strategy", "COMPRESSION");

            }
        }).save("spark_orc_output");

Apart from this, i tried these properties set in hive-site.xml and hiveContext object also.

hive --orcfiledump on output confirms that the configurations not applied. Orcfiledump snippet is below.

Compression: ZLIB
Compression size: 262144

回答1:

You are making two different errors here. I don't blame you; I've been there...

Issue #1
orc.compress and the rest are not Spark DataFrameWriter options. They are Hive configuration properties, that must be defined before creating the hiveContext object...

  • either in the hive-site.xml available to Spark at launch time
  • or in your code, by re-creating the SparkContext...

 sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
 sc.stop
 val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
 scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
 val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)

[Edit] with Spark 2.x the script would become...
 spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
 spark.close
 val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
 sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy

Issue #2
Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.

There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext.

For ORC and the other formats, you have to resort to format-specific DataFrameWriter options; quoting the latest JavaDoc...

You can set the following ORC-specific option(s) for writing ORC files:
compression (default snappy): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, and lzo). This will override orc.compress

Note that the default compression codec has changed with Spark 2; before that it was zlib

So the only thing you can set is the compression codec, using

dataframe.write().format("orc").option("compression","snappy").save("wtf")