设置Hadoop的参数与伯特？(Setting hadoop parameters with bot

2019-06-27 06:37发布

我试图让错误的输入跳过我的亚马逊弹性MapReduce作业。我下面精彩的配方说明如下：

http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code

上面的链接说，我需要以某种方式设置的EMR任务以下配置参数：

mapred.skip.mode.enabled=true
mapred.skip.map.max.skip.records=1
mapred.skip.attempts.to.start.skipping=2
mapred.map.tasks=1000
mapred.map.max.attempts=10

如何设置这些（及其他）mapred.XXX使用博托一个JobFlow参数？

Answer 1:

多小时的奋力，读码和实验后，这里的答案：

您需要添加一个新的BootstrapAction，就像这样：

params = ['-s','mapred.skip.mode.enabled=true',
          '-s', 'mapred.skip.map.max.skip.records=1',
          '-s', 'mapred.skip.attempts.to.start.skipping=2',
          '-s', 'mapred.map.max.attempts=5',
          '-s', 'mapred.task.timeout=100000']
config_bootstrapper = BootstrapAction('Enable skip mode', 's3://elasticmapreduce/bootstrap-actions/configure-hadoop', params)

conn = EmrConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
step = StreamingStep(name='My Step', ...)
conn.run_jobflow(..., bootstrap_actions=[config_bootstrapper], steps=[step], ...)

当然，如果你有一个以上的自举的动作，你应该只把它添加到bootstrap_actions阵列。

文章来源: Setting hadoop parameters with boto?