How to control the number of output part files cre

2019-04-14 14:55发布

Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code

finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as of Spark 1.4

Spark job creates plenty of small part files in final output directory. As far as I understand Spark creates part file for each partition/task please correct me if I am wrong. How do we control amount of part files Spark creates? Finally I would like to create Hive table using these parquet/orc directory and I heard Hive is slow when we have large no of small files. Please guide I am new to Spark. Thanks in advance.

标签： apache-spark hive apache-spark-sql parquet

2条回答

放我归山

2楼-- · 2019-04-14 15:20

You may want to try using the DataFrame.coalesce method to decrease the number of partitions; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion).

To increase or decrease the partitions you can use Dataframe.repartition function. But coalesce does not cause shuffle while repartition does.

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-04-14 15:38

Since 1.6 you can use repartition on data frame, which means you'll get 1 file per hive partition. Beware of large shuffles though, best to have your DF partitioned properly from starts if possible. See https://stackoverflow.com/a/32920122/2204206

0人赞添加讨论(0) 举报

How to control the number of output part files cre

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间