How many partitions does Spark create when a file

2020-06-04 02:54发布

If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket?

标签： hadoop apache-spark amazon-s3 rdd bigdata

2条回答

Luminary・发光体

2楼-- · 2020-06-04 03:10

See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits().

Block size depends on S3 file system implementation (see FileStatus.getBlockSize()). E.g. S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play).

Also, you don't get splits if your InputFormat is not splittable :)

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2020-06-04 03:14

Spark will treat S3 as if it were a block-based filesystem, so partitioning rules for HDFS and S3 inputs are the same: by default you will get one partition per one block. It is worth inspecting number of created partitions yourself:

val inputRDD = sc.textFile("s3a://...")
println(inputRDD.partitions.length)

For further reading I suggest this, which covers partitioning rules in detail.

0人赞添加讨论(0) 举报

How many partitions does Spark create when a file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间