How to combine small parquet files to one large pa

2019-08-18 15:46发布

问题:

This question already has an answer here:

  • Spark dataframe write method writing many small files 6 answers

I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and rewriting the combined dataframe to the same partition and deleting the old ones. But this seems inefficient or beginner level type to me, for some reason. What are the pros and cons of doing it this way? And, if there are any other ways please guide me to achieve it in spark or pyspark.

回答1:

You can read the whole data, repartition by the partitions you have and then write using the partitionBy (this is how you should also save them in the future). Something like:

spark\
    .read\
    .parquet('...'))\
    .repartition('key1', 'key2',...)\
    .write\
    .partitionBy('key1', 'key2',...)\
    .option('path', target_part)\
    .saveAsTable('partitioned')