I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are:
(Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013, page 89.
The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws.amazon.com/emr/pricing/
Sorry if i missed something obvious.
Wayne
To determine the number of parallel mappers , you will need to check this documentation from EMR called Task Configuration where EMR had a predefined mapping set of configurations for every instance type which would determine the number of mappers/reducers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
For example : Lets say you have 5 m1.xlarge core nodes. According to the default mapred-site.xml configuration values for that instance type from EMR docs, we have
You can simply divide the later with former setting to get the maximum number of mappers supported by one m1.xlarge node
= (12288/768) = 16
So, for the 5 node cluster , it would a max of
16*5 = 80
mappers that can run in parallel (considering a map only job). The same is the case with max parallel Reducers(30). You can do similar math for a combination of mappers and reducers.So, If you want to run more mappers in parallel , you can either
re-size
the cluster or reduce themapreduce.map.memory.mb
(and its heapmapreduce.map.java.opts
) on every node and restart NM toTo understand what the above mapred-site.xml properties mean and why you do need to do those calculations , you can refer it here : https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Note : The above calculations and statements are true if EMR stays in its default configuration using
YARN capacity scheduler
withDefaultResourceCalculator
. If for example , you configure your capacity scheduler to useDominantResourceCalculator
, it will consider VCPU's + Memory on every nodes (not just memory's) to decide on parallel number of mappers.