I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.
Output from writing parquet write
_common_metadata part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet _SUCCESS
_metadata part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet
Hive table
CREATE TABLE testhive
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/home/gz_files/result';
FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified
How can I infer the meta data from parquet file?
If I open the _common_metadata
I have below content,
PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata▒{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}
Or how to parse meta data file?
A small improvement over Victor (adding quotes on field.name) and modified to bind the table to a local parquet file (tested on spark 1.6.1)
Also notice that:
Actually, Impala supports
(no columns section altogether):
http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-1-x/topics/impala_create_table.html
Tags of your question have "hive" and "spark" and I don't see this is implemented in Hive, but in case you use CDH, it may be what you were looking for.
Here's a solution I've come up with to get the metadata from parquet files in order to create a Hive table.
First start a spark-shell (Or compile it all into a Jar and run it with spark-submit, but the shell is SOO much easier)
It will provide you with the datatypes that Hive will use for each column as they are stored in Parquet. E.G:
CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'
I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:
http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1
For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).
Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation" http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging So infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from
I'd just like to expand on James Tobin's answer. There's a StructField class which provides Hive's data types without doing string replacements.
This solves the IntegerType problem.
This should work with any DataFrame, not just with Parquet. (e.g., I'm using this with a JDBC DataFrame.)
As an added bonus, if your target DDL supports nullable columns, you can extend the function by checking
StructField.nullable
.