I'm trying to launch a cluster and run a job all using boto.
I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:
- How to define the cluster to be used (by clusted_id)
- How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)
Am I missing something?
Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.
You create a new cluster by calling the boto.emr.connection.run_jobflow()
function. It will return the cluster ID which EMR generates for you.
First all the mandatory things:
#!/usr/bin/env python
import boto
import boto.emr
from boto.emr.instance_group import InstanceGroup
conn = boto.emr.connect_to_region('us-east-1')
Then we specify instance groups, including the spot price we want to pay for the TASK nodes:
instance_groups = []
instance_groups.append(InstanceGroup(
num_instances=1,
role="MASTER",
type="m1.small",
market="ON_DEMAND",
name="Main node"))
instance_groups.append(InstanceGroup(
num_instances=2,
role="CORE",
type="m1.small",
market="ON_DEMAND",
name="Worker nodes"))
instance_groups.append(InstanceGroup(
num_instances=2,
role="TASK",
type="m1.small",
market="SPOT",
name="My cheap spot nodes",
bidprice="0.002"))
Finally we start a new cluster:
cluster_id = conn.run_jobflow(
"Name for my cluster",
instance_groups=instance_groups,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
enable_debugging=True,
log_uri="s3://mybucket/logs/",
hadoop_version=None,
ami_version="2.4.9",
steps=[],
bootstrap_actions=[],
ec2_keyname="my-ec2-key",
visible_to_all_users=True,
job_flow_role="EMR_EC2_DefaultRole",
service_role="EMR_DefaultRole")
We can also print the cluster ID if we care about that:
print "Starting cluster", cluster_id
I believe the minimum amount of Python that will launch an EMR cluster with boto3 is:
import boto3
client = boto3.client('emr', region_name='us-east-1')
response = client.run_job_flow(
Name="Boto3 test cluster",
ReleaseLabel='emr-5.12.0',
Instances={
'MasterInstanceType': 'm4.xlarge',
'SlaveInstanceType': 'm4.xlarge',
'InstanceCount': 3,
'KeepJobFlowAliveWhenNoSteps': True,
'TerminationProtected': False,
'Ec2SubnetId': 'my-subnet-id',
'Ec2KeyName': 'my-key',
},
VisibleToAllUsers=True,
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole'
)
Notes: you'll have to create EMR_EC2_DefaultRole
and EMR_DefaultRole
. The Amazon documentation claims that JobFlowRole
and ServiceRole
are optional, but omitting them did not work for me. That could be because my subnet is a VPC subnet, but I'm not sure.