How to launch and configure an EMR cluster using b

2019-02-03 06:14发布

问题:

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:

  1. How to define the cluster to be used (by clusted_id)
  2. How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)

Am I missing something?

回答1:

Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.

You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the cluster ID which EMR generates for you.

First all the mandatory things:

#!/usr/bin/env python

import boto
import boto.emr
from boto.emr.instance_group import InstanceGroup

conn = boto.emr.connect_to_region('us-east-1')

Then we specify instance groups, including the spot price we want to pay for the TASK nodes:

instance_groups = []
instance_groups.append(InstanceGroup(
    num_instances=1,
    role="MASTER",
    type="m1.small",
    market="ON_DEMAND",
    name="Main node"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="CORE",
    type="m1.small",
    market="ON_DEMAND",
    name="Worker nodes"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="TASK",
    type="m1.small",
    market="SPOT",
    name="My cheap spot nodes",
    bidprice="0.002"))

Finally we start a new cluster:

cluster_id = conn.run_jobflow(
    "Name for my cluster",
    instance_groups=instance_groups,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    enable_debugging=True,
    log_uri="s3://mybucket/logs/",
    hadoop_version=None,
    ami_version="2.4.9",
    steps=[],
    bootstrap_actions=[],
    ec2_keyname="my-ec2-key",
    visible_to_all_users=True,
    job_flow_role="EMR_EC2_DefaultRole",
    service_role="EMR_DefaultRole")

We can also print the cluster ID if we care about that:

print "Starting cluster", cluster_id


回答2:

I believe the minimum amount of Python that will launch an EMR cluster with boto3 is:

import boto3

client = boto3.client('emr', region_name='us-east-1')

response = client.run_job_flow(
    Name="Boto3 test cluster",
    ReleaseLabel='emr-5.12.0',
    Instances={
        'MasterInstanceType': 'm4.xlarge',
        'SlaveInstanceType': 'm4.xlarge',
        'InstanceCount': 3,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': 'my-subnet-id',
        'Ec2KeyName': 'my-key',
    },
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
)

Notes: you'll have to create EMR_EC2_DefaultRole and EMR_DefaultRole. The Amazon documentation claims that JobFlowRole and ServiceRole are optional, but omitting them did not work for me. That could be because my subnet is a VPC subnet, but I'm not sure.