Note that this question is not a duplicate of this one! I do not use Spark SQL for partitioning! I am saving individual parquet files!
I am also using Databricks which does not support Hive-flavored SQL.
I have a table in Hive (I'm using Databricks) which contains two partitions. It looks like this:
CREATE TABLE foo_test (`col0` STRING, `col1` STRING, `col2` STRING, `datestamp` STRING)
USING parquet
OPTIONS (
`serialization.format` '1'
)
PARTITIONED BY (datestamp)
Edit: *this is also the result of calling show create table foo_test;
I've added two partitions with Spark to this table by hand:
df = spark.read.csv(file_path.format(datestamp), header=True, inferSchema=True)
partitionPath = "/mnt/foo_test/datestamp={}/".format(datestamp)
df.coalesce(1).write.parquet(partitionPath, mode="overwrite")
/mnt/foo_test/datestamp=20180101/
/mnt/foo_test/datestamp=20180102/
If I load the data with Spark I can see it is there:
spark.read.option("mergeSchema", True).parquet("/mnt/foo_test").show()
+----+----+----+----+---------+
|col0|col1|col2|col3|datestamp|
+----+----+----+----+---------+
| foo| bar| baz| 1| 20180102|
| xul| qux| wom| 2| 20180102|
| bar| foo| baz| 1| 20180102|
| qux| xul| wom| 2| 20180102|
| foo| bar| baz|null| 20180101|
| xul| qux| wom|null| 20180101|
| bar| foo| baz|null| 20180101|
| qux| xul| wom|null| 20180101|
+----+----+----+----+---------+
My problem is that if I run an SQL query against this Hive table it returns nothing:
SELECT * FROM foo_test;
OK
Even after adding the partition by hand:
spark.sql("ALTER TABLE foo_test ADD IF NOT EXISTS PARTITION (datestamp=20180102)")
and repairing the table:
MSCK REPAIR TABLE foo_test;
I can see that the partitions are present according to Hive:
SHOW PARTITIONS foo_test;
partition
datestamp=20180102
datestamp=20180101
but the SELECT
returns nothing.
This is the description of my table:
col0 string null
col1 string null
col2 string null
datestamp string null
# Partition Information
# col_name data_type comment
datestamp string null
# Detailed Table Information
Database default
Table foo_test
Owner root
Created Thu Apr 26 12:25:06 UTC 2018
Last Access Thu Jan 01 00:00:00 UTC 1970
Type MANAGED
Provider parquet
Table Properties [transient_lastDdlTime=1524745506]
Location dbfs:/user/hive/warehouse/foo_test
Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties [serialization.format=1]
Partition Provider Catalog
What could be the problem here?
I have to disagree. This is not a Hive table definition:
It is a Spark table definition.
Hive table definition would be:
So you do use Spark partitioning, and as already explained in the question you've linked, and further in the linked JIRA ticket, Spark and Hive partitioning schemes are not compatible..
Note that both Hive and Spark are supported in
SparkSession.sql
as long as Hive support has been enabled, whenSparkSession
has been initialized (default on Databricks platform).Also it is not clear why you write to
/mnt/foo_test/datestamp={}
here, which might another source of problems. If you want to use local file API (why would you?), Databricks mounts it by default on/dbfs
.Since you call
ADD PARTITIONS
without location, it uses root path for the table (dbfs:/user/hive/warehouse/foo_test
based on theDESCRIBE
output), so if you decided to use local API, and use default configuration you should write toIf you use non-standard configuration it would be great if include it in your question.
The Table defination is pointitng to different location
It should point to the following location
/mnt/foo_test
Look for hive create table with location
when you create your table in hive use:
you can use with spark write with partition :
this will write to hive table with the partitionned path /mnt/foo_test/datestamp=***/.
i hope this will help
You are not setting the location in your
CREATE TABLE
definition, and you're not setting the location of the newly added partition.In your table definition, you should define it as an external table and give it the path to your data, either as the
LOCATION
orPATH
parameter. ThenMSCK REPAIR TABLE
should properly add the partitions.For the
ALTER TABLE
command, you need to set theLOCATION
parameter. In that snippet you are just telling the table "there's a partition date=20180102" without telling it where the data is.per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions
spark.sql("ALTER TABLE foo_test ADD IF NOT EXISTS PARTITION (datestamp=20180102) location '/mnt/foo_test/datestamp=20180102/' ")