Amazon EC2 vs. Amazon EMR [closed]

2020-02-03 05:07发布

站内文章 / 移动开发

38 0

女痞

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 5 years ago.

I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS.

I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR?

I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible?

Please suggest me as soon as possible.

Many Thanks.

回答1:

EMR is a collection of EC2 instances with Hadoop (and optionally Hive and/or Pig) installed and configured on them. If you are using your cluster for running Hadoop/Hive/Pig jobs, EMR is the way to go. An EMR instance costs a little bit extra as compared to an EC2 instance. A quick check on Amazon prices today reveals that a small EC2 instances costs $0.08/hour while a small EMR instance costs $0.015/hour extra. In my opinion, it's totally worth paying that extra money to save yourself the hassle of installing and setting up Hadoop (along with Hive and Pig), creating and maintaining and AMI and using it. Moreover, EMR's version of Hadoop and Hive has some patches that are not available (atleast, not yet) on Apache Hive. If you use EC2, you will probably be using Apache Hadoop and Hive (or may be, the cloudera distributions) and wouldn't have access to those patches (like native support for S3 or commands like ALTER TABLE my_table RECOVER PARTITIONS

References:

http://aws.amazon.com/ec2/pricing/
http://aws.amazon.com/elasticmapreduce/pricing/

回答2:

I would suggest that you do NOT try and deploy your own Hadoop cluster, unless you have 2-3 months to spare, and you have a hadoop expert handy.

Elastic MapReduce will allow you to get started very quickly by providing a pre-configured hadoop environment. Seeing as you only have a single job, it should be fine.

回答3:

In general, historically, EMR was pretty far behind the latest versions of Hadoop components, and some were missing entirely. That's the major reason for using another distribution. For example, if you wanted HBase, it wasn't in EMR, but not it is. Today, Spark is absent from EMR. EMR will generally lag.

That said, if you're not using the latest and greatest features, go with EMR.