This question already has an answer here:
- Spark dataframe write method writing many small files 6 answers
I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and rewriting the combined dataframe to the same partition and deleting the old ones. But this seems inefficient or beginner level type to me, for some reason. What are the pros and cons of doing it this way? And, if there are any other ways please guide me to achieve it in spark or pyspark.