I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:
- How to define the cluster to be used (by clusted_id)
- How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)
Am I missing something?
Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.
You create a new cluster by calling the
boto.emr.connection.run_jobflow()
function. It will return the cluster ID which EMR generates for you.First all the mandatory things:
Then we specify instance groups, including the spot price we want to pay for the TASK nodes:
Finally we start a new cluster:
We can also print the cluster ID if we care about that:
I believe the minimum amount of Python that will launch an EMR cluster with boto3 is:
Notes: you'll have to create
EMR_EC2_DefaultRole
andEMR_DefaultRole
. The Amazon documentation claims thatJobFlowRole
andServiceRole
are optional, but omitting them did not work for me. That could be because my subnet is a VPC subnet, but I'm not sure.