Running a R script using hadoop streaming Job Fail

I have a R script which works perfectly fine in R Colsole ,but when I am running in Hadoop streaming it is failing with the below error in Map phase .Find the Task attempts log

The Hadoop Streaming Command I have :

/home/Bibhu/hadoop-0.20.2/bin/hadoop jar \
   /home/Bibhu/hadoop-0.20.2/contrib/streaming/*.jar \
   -input hdfs://localhost:54310/user/Bibhu/BookTE1.csv \
   -output outsid -mapper `pwd`/code1.sh

stderr logs

Loading required package: class
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input
Calls: read.csv -> read.table
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

syslog logs

2013-07-03 19:32:36,080 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2013-07-03 19:32:36,654 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-07-03 19:32:36,675 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2013-07-03 19:32:36,899 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/home/Bibhu/Downloads/SentimentAnalysis/Sid/smallFile/code1.sh]
2013-07-03 19:32:37,256 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=0/1
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2013-07-03 19:32:38,557 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
2013-07-03 19:32:38,631 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task

标签： r hadoop mapreduce hadoop-streaming

3条回答

狗以群分

2楼-- · 2019-06-14 11:50

You need to find the logs from your mappers and reducers, since this is the place where the job is failing (as indicated by java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1). This says that your R script crashed.

If you are using the Hortonworks Hadoop distribuion, the easiest way is to open your jobhistory. It should be at http://127.0.0.1:19888/jobhistory . It should be possible to find the log in the filesystem using the command line as well, but I haven't yet found where.

Open http://127.0.0.1:19888/jobhistory in your web browser
Click on the Job ID of the failed job
Click the number indicating the failed job count
Click an attempt link
Click the logs link

You should see a page which looks something like

Log Type: stderr
Log Length: 418
Traceback (most recent call last):
  File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 45, in <module>
    mapper()
  File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 37, in mapper
    for record in reader:
_csv.Error: newline inside string

This is an error from my Python script, the errors from R look a bit different.

source: http://hortonworks.com/community/forums/topic/map-reduce-job-log-files/

0人赞添加讨论(0) 举报

傲

3楼-- · 2019-06-14 12:01

write hadoopStreamming jar with full version like hadoop-streaming-1.0.4.jar
specify separate file path for mapper & reducer with -file option
tell hadoop which is your mapper & reducer code with -mapper & -reducer option

for more ref see Running WordCount on Hadoop using R script

0人赞添加讨论(0) 举报

Explosion°爆炸

4楼-- · 2019-06-14 12:01

I received this same error tonight, while also developing Map Reduce Streaming jobs with R.

I was working on a 10 node cluster, each with 12 cores, and tried to supply at submission time:

-D mapred.map.tasks=200\
-D mapred.reduce.tasks=200

The job completed successfully though when I changed these to

-D mapred.map.tasks=10\
-D mapred.reduce.tasks=10

This was a mysterious fix, and perhaps more context will arise this evening. But if any readers can elucidate, please do!

0人赞添加讨论(0) 举报

Running a R script using hadoop streaming Job Fail

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间