S3 and EMR data locality

2020-06-03 02:59发布

问题:

Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:

  • EC2
  • EMR + S3

The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.

What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?

What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.

I'm hopping for an answer from some person having practical experience with this. Thank you.

回答1:

EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

As for data locality, S3 is RACK_LOCAL to EMR spark clusters.