Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
I have implemented a task in Hive. Currently it is working fine on my single node cluster.
Now I am planning to deploy it on AWS.
I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR?
I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible?
Please suggest me as soon as possible.
Many Thanks.
EMR is a collection of EC2 instances with Hadoop (and optionally Hive and/or Pig) installed and configured on them. If you are using your cluster for running Hadoop/Hive/Pig jobs, EMR is the way to go. An EMR instance costs a little bit extra as compared to an EC2 instance. A quick check on Amazon prices today reveals that a small EC2 instances costs $0.08/hour while a small EMR instance costs $0.015/hour extra.
In my opinion, it's totally worth paying that extra money to save yourself the hassle of installing and setting up Hadoop (along with Hive and Pig), creating and maintaining and AMI and using it. Moreover, EMR's version of Hadoop and Hive has some patches that are not available (atleast, not yet) on Apache Hive. If you use EC2, you will probably be using Apache Hadoop and Hive (or may be, the cloudera distributions) and wouldn't have access to those patches (like native support for S3 or commands like ALTER TABLE my_table RECOVER PARTITIONS
References:
- http://aws.amazon.com/ec2/pricing/
- http://aws.amazon.com/elasticmapreduce/pricing/
I would suggest that you do NOT try and deploy your own Hadoop cluster, unless you have 2-3 months to spare, and you have a hadoop expert handy.
Elastic MapReduce will allow you to get started very quickly by providing a pre-configured hadoop environment. Seeing as you only have a single job, it should be fine.
In general, historically, EMR was pretty far behind the latest versions of Hadoop components, and some were missing entirely. That's the major reason for using another distribution. For example, if you wanted HBase, it wasn't in EMR, but not it is. Today, Spark is absent from EMR. EMR will generally lag.
That said, if you're not using the latest and greatest features, go with EMR.