How to combine small parquet files to one large pa

2019-08-18 15:48发布

This question already has an answer here:

Spark dataframe write method writing many small files 6 answers

I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and rewriting the combined dataframe to the same partition and deleting the old ones. But this seems inefficient or beginner level type to me, for some reason. What are the pros and cons of doing it this way? And, if there are any other ways please guide me to achieve it in spark or pyspark.

标签： apache-spark hive pyspark parquet

1条回答

一纸荒年 Trace。

2楼-- · 2019-08-18 16:24

You can read the whole data, repartition by the partitions you have and then write using the partitionBy (this is how you should also save them in the future). Something like:

spark\
    .read\
    .parquet('...'))\
    .repartition('key1', 'key2',...)\
    .write\
    .partitionBy('key1', 'key2',...)\
    .option('path', target_part)\
    .saveAsTable('partitioned')

0人赞添加讨论(0) 举报

How to combine small parquet files to one large pa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间