Spark 1.6.0 Hive 1.1.0-cdh5.8.0
I have some problems saving my dataframe into parquet-backed partitioned Hive table from Spark.
here is my code:
val df = sqlContext.createDataFrame(rowRDD, schema)
df.write
.mode(SaveMode.Append)
.format("parquet")
.partitionBy("year")
.saveAsTable(output)
nothing special, actually, but I can't read any data from the table when it's generated.
Key point is in partitioning - without it everything works fine. Here are my steps to fix the problem:
At first, on simple select hive returns that the table is not partitioned. - Ok, it seems like Spark forget to mention about partitioning scheme in DDL. I fixed it creating the table manually
Attempt #2 - still nothing, what is going on actually, is that hive metastore doesn't know that the table has any partitions in dwh. Fixed it by: hive> msck repair table
Attempt #3 - nope, and now hive bursts with exception, smth like: java.io.IOException:ort.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException. Ok, spark defined wrong serializer. Fixed it setting
STORED AS PARQUET
Nope. Don't remember what exceptin it was, but I realized that spark replaced my scheme with single column:
col
array COMMENT 'from deserializer' I replaced it with corrent one - some another problem came out.
And here I'm done. To me it seems like spark generates totally wrong ddl trying to create non-existing table in hive. But everything works pretty fine when I just remove partition statement.
So where am I wrong or, perhaps, there is a quick fix to that problem?