Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code
finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as of Spark 1.4
Spark job creates plenty of small part files in final output directory. As far as I understand Spark creates part file for each partition/task please correct me if I am wrong. How do we control amount of part files Spark creates? Finally I would like to create Hive table using these parquet/orc directory and I heard Hive is slow when we have large no of small files. Please guide I am new to Spark. Thanks in advance.
You may want to try using the DataFrame.coalesce method to decrease the number of partitions; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion).
To increase or decrease the partitions you can use
Dataframe.repartition
function. Butcoalesce
does not cause shuffle whilerepartition
does.Since 1.6 you can use repartition on data frame, which means you'll get 1 file per hive partition. Beware of large shuffles though, best to have your DF partitioned properly from starts if possible. See https://stackoverflow.com/a/32920122/2204206