If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Java写文件至HDFS失败
- Livy Server: return a dataframe as JSON?
- how many objects are returned by aws s3api list-ob
- AWS S3 in rails - how to set the s3_signature_vers
- mapreduce count example
- PUT to S3 with presigned url gives 403 error
- SQL query Frequency Distribution matrix for produc
- php - unlink throws error: Resource temporarily un
See the code of
org.apache.hadoop.mapred.FileInputFormat.getSplits()
.Block size depends on S3 file system implementation (see
FileStatus.getBlockSize()
). E.g.S3AFileStatus
just set it equals to0
(and thenFileInputFormat.computeSplitSize()
comes into play).Also, you don't get splits if your InputFormat is not splittable :)
Spark will treat S3 as if it were a block-based filesystem, so partitioning rules for HDFS and S3 inputs are the same: by default you will get one partition per one block. It is worth inspecting number of created partitions yourself:
For further reading I suggest this, which covers partitioning rules in detail.