Data from partitioned table does not show up when

2019-06-14 18:31发布

Note that this question is not a duplicate of this one! I do not use Spark SQL for partitioning! I am saving individual parquet files!

I am also using Databricks which does not support Hive-flavored SQL.

I have a table in Hive (I'm using Databricks) which contains two partitions. It looks like this:

CREATE TABLE foo_test (`col0` STRING, `col1` STRING, `col2` STRING, `datestamp` STRING)
USING parquet
OPTIONS (
  `serialization.format` '1'
)
PARTITIONED BY (datestamp)

Edit: *this is also the result of calling show create table foo_test;

I've added two partitions with Spark to this table by hand:

df = spark.read.csv(file_path.format(datestamp), header=True, inferSchema=True)

partitionPath = "/mnt/foo_test/datestamp={}/".format(datestamp)

df.coalesce(1).write.parquet(partitionPath, mode="overwrite")


/mnt/foo_test/datestamp=20180101/
/mnt/foo_test/datestamp=20180102/

If I load the data with Spark I can see it is there:

spark.read.option("mergeSchema", True).parquet("/mnt/foo_test").show()

+----+----+----+----+---------+
|col0|col1|col2|col3|datestamp|
+----+----+----+----+---------+
| foo| bar| baz|   1| 20180102|
| xul| qux| wom|   2| 20180102|
| bar| foo| baz|   1| 20180102|
| qux| xul| wom|   2| 20180102|
| foo| bar| baz|null| 20180101|
| xul| qux| wom|null| 20180101|
| bar| foo| baz|null| 20180101|
| qux| xul| wom|null| 20180101|
+----+----+----+----+---------+

My problem is that if I run an SQL query against this Hive table it returns nothing:

SELECT * FROM foo_test;

OK

Even after adding the partition by hand:

spark.sql("ALTER TABLE foo_test ADD IF NOT EXISTS PARTITION (datestamp=20180102)")

and repairing the table:

MSCK REPAIR TABLE foo_test;

I can see that the partitions are present according to Hive:

SHOW PARTITIONS foo_test;

partition
datestamp=20180102
datestamp=20180101

but the SELECT returns nothing.

This is the description of my table:

col0    string  null
col1    string  null
col2    string  null
datestamp   string  null
# Partition Information     
# col_name  data_type   comment
datestamp   string  null
# Detailed Table Information        
Database    default 
Table   foo_test    
Owner   root    
Created Thu Apr 26 12:25:06 UTC 2018    
Last Access Thu Jan 01 00:00:00 UTC 1970    
Type    MANAGED 
Provider    parquet 
Table Properties    [transient_lastDdlTime=1524745506]  
Location    dbfs:/user/hive/warehouse/foo_test  
Serde Library   org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe 
InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat   
OutputFormat    org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat  
Storage Properties  [serialization.format=1]    
Partition Provider  Catalog

What could be the problem here?

4条回答
唯我独甜
2楼-- · 2019-06-14 18:53

I do not use Spark SQL for partitioning!

I have to disagree. This is not a Hive table definition:

CREATE TABLE foo_test (`col0` STRING, `col1` STRING, `col2` STRING, `datestamp` STRING)
USING parquet
OPTIONS (
  `serialization.format` '1'
)
PARTITIONED BY (datestamp)

It is a Spark table definition.

Hive table definition would be:

CREATE TABLE foo_test (`col0` STRING, `col1` STRING, `col2` STRING)
PARTITIONED BY (`datestamp` STRING)
STORED AS PARQUET

So you do use Spark partitioning, and as already explained in the question you've linked, and further in the linked JIRA ticket, Spark and Hive partitioning schemes are not compatible..

Note that both Hive and Spark are supported in SparkSession.sql as long as Hive support has been enabled, when SparkSession has been initialized (default on Databricks platform).

Also it is not clear why you write to /mnt/foo_test/datestamp={} here, which might another source of problems. If you want to use local file API (why would you?), Databricks mounts it by default on /dbfs.

Since you call ADD PARTITIONS without location, it uses root path for the table (dbfs:/user/hive/warehouse/foo_test based on the DESCRIBE output), so if you decided to use local API, and use default configuration you should write to

/dbfs/user/hive/warehouse/foo_test/datestamp=20180102

If you use non-standard configuration it would be great if include it in your question.

查看更多
手持菜刀,她持情操
3楼-- · 2019-06-14 18:59

The Table defination is pointitng to different location

Location    dbfs:/user/hive/warehouse/foo_test  

It should point to the following location /mnt/foo_test

Look for hive create table with location

查看更多
老娘就宠你
4楼-- · 2019-06-14 19:02

when you create your table in hive use:

CREATE TABLE foo_test (`col0` STRING, `col1` STRING, `col2` STRING)
PARTITIONED BY (`datestamp` STRING)
stored as PARQUET
location '/mnt/foo_test';

you can use with spark write with partition :

df.repartition($"datestamp").partitionBy("datestamp").write.parquet(partitionPath, mode="overwrite")

this will write to hive table with the partitionned path /mnt/foo_test/datestamp=***/.

i hope this will help

查看更多
兄弟一词,经得起流年.
5楼-- · 2019-06-14 19:04

You are not setting the location in your CREATE TABLE definition, and you're not setting the location of the newly added partition.

In your table definition, you should define it as an external table and give it the path to your data, either as the LOCATION or PATH parameter. Then MSCK REPAIR TABLE should properly add the partitions.

For the ALTER TABLE command, you need to set the LOCATION parameter. In that snippet you are just telling the table "there's a partition date=20180102" without telling it where the data is.

per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions

spark.sql("ALTER TABLE foo_test ADD IF NOT EXISTS PARTITION (datestamp=20180102) location '/mnt/foo_test/datestamp=20180102/' ")

查看更多
登录 后发表回答