Convert JSON to Parquet

I have a few TB logs data in JSON format, I want to convert them into Parquet format to gain better performance in analytics stage.

I've managed to do this by writing a mapreduce java job which uses parquet-mr and parquet-avro.

The only thing I'm not satisfied with is that, my JSON logs doesn't have a fixed schema, I don't know all the fields' names and types. Besides, even I know all the fields' names and types, my schema evolves as time goes on, for example, there will be new fields added in future.

For now I have to provide a Avro schema for AvroWriteSupport, and avro only allows fixed number of fields.

Is there a better way to store arbitrary fields in Parquet, just like JSON?

标签： avro parquet

2条回答

Animai°情兽

2楼-- · 2019-03-16 21:18

Use Apache Drill!

From https://drill.apache.org/docs/parquet-format/, in 1 line of SQL.

After setup Apache Drill (with or without HDFS), execute sqline.sh to run SQL queries:

// Set default format ALTER SESSION SET `store.format` = 'parquet'; 
ALTER SYSTEM SET `store.format` = 'parquet';

// Migrate data
CREATE TABLE dfs.tmp.sampleparquet AS  (SELECT trans_id,  cast(`date` AS date) transdate,  cast(`time` AS time) transtime,  cast(amount AS double) amountm, user_info, marketing_info, trans_info  FROM dfs.`/Users/drilluser/sample.json`);

Should take a few time, maybe hours, but at the end, you have light and cool parquet files ;-)

In my test, query a parquet file is x4 faster than JSON and ask less ressources.

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2019-03-16 21:44

One thing for sure is that Parquet needs a Avro schema in advance. We'll focus on how to get the schema.

Use SparkSQL to convert JSON files to Parquet files.

SparkSQL can infer a schema automatically from data, thus we don't need to provide a schema by ourselves. Every time the data changes, SparkSQL will infer out a different schema.
Maintain an Avro schema manually.

If you don't use Spark but only Hadoop, you need to infer the schema manually. First write a mapreduce job to scan all JSON files and get all fields, after you know all fields you can write an Avro schema. Use this schema to convert JSON files to Parquet files.

There will be new unknown fields in future, every time there are new fields, add them to the Avro schema. So basically we're doing SparkSQL's job manually.

0人赞添加讨论(0) 举报

Convert JSON to Parquet

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间