Using s3 as fs.default.name or HDFS?

I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks?

Thanks!

标签： amazon-ec2 hadoop amazon-emr

4条回答

叛逆

2楼-- · 2019-07-28 12:45

in order to use S3 instead of HDFS fs.name.default in core-site.xml needs to point to your bucket:

<property>
        <name>fs.default.name</name>
        <value>s3n://your-bucket-name</value>
</property>

It's recommended that you use S3N and NOT simple S3 implementation, because S3N is readble by any other application and by yourself :)

Also, in the same core-site.xml file you need to specify the following properties:

fs.s3n.awsAccessKeyId
fs.s3n.awsSecretAccessKey

fs.s3n.awsSecretAccessKey

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-07-28 12:46

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/core-default.xml

fs.default.name is deprecated, and maybe fs.defaultFS is better.

0人赞添加讨论(0) 举报

放荡不羁爱自由

4楼-- · 2019-07-28 12:53

Any intermediate data of your job goes to HDFS, so yes, you still need a namenode and datanodes

0人赞添加讨论(0) 举报

冷血范

5楼-- · 2019-07-28 12:56

I was able to get the s3 integration working using

<property>
        <name>fs.default.name</name>
        <value>s3n://your-bucket-name</value>
</property>

in the core-site.xml and get the list of the files get using hdfs ls command.but should also should have namenode and separate datanode configurations, coz still was not sure how the data gets partitioned in the data nodes.

should we have local storage for namenode and datanode?

0人赞添加讨论(0) 举报

Using s3 as fs.default.name or HDFS?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间