This question already has an answer here:
I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and rewriting the combined dataframe to the same partition and deleting the old ones. But this seems inefficient or beginner level type to me, for some reason. What are the pros and cons of doing it this way? And, if there are any other ways please guide me to achieve it in spark or pyspark.
You can read the whole data,
repartition
by the partitions you have and then write using thepartitionBy
(this is how you should also save them in the future). Something like: