Hadoop Streaming Command Failure with Python Error

I'm a newcomer to Ubuntu, Hadoop and DFS but I've managed to install a single-node hadoop instance on my local ubuntu machine following the directions posted on Michael-Noll.com here:

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#copy-local-example-data-to-hdfs

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

I'm currently stuck on running the basic word count example on Hadoop. I'm not sure if the fact I've been running Hadoop out of my Downloads directory makes too much of a difference, but I've atempted to tweek around my file locations for the mapper.py and reducer.py functions by placing them in the Hadooop working directory with no success. I've exhausted all of my research and still cannot solve this problem (i.e.- using -file parameters, etc.) I really appreciate any help in advance and I hope I framed this question in a way that can help others who are just beginning with Python + Hadoop.

I tested the mapper.py and reduce.py independently and both work fine when prompted with toy text data from the bash shell.

Output from my Bash Shell:

hduser@chris-linux:/home/chris/Downloads/hadoop$ bin/hadoop jar /home/chris/Downloads/hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar -file mapper.py -file reducer.py -mapper mapper.py -reducer reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output3
Warning: $HADOOP_HOME is deprecated.

packageJobJar: [mapper.py, reducer.py, /app/hadoop/tmp/hadoop-unjar4681300115516015516/] [] /tmp/streamjob2215860242221125845.jar tmpDir=null
13/03/08 14:43:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/08 14:43:46 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/08 14:43:46 INFO mapred.FileInputFormat: Total input paths to process : 3
13/03/08 14:43:47 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
13/03/08 14:43:47 INFO streaming.StreamJob: Running job: job_201303081155_0032
13/03/08 14:43:47 INFO streaming.StreamJob: To kill this job, run:
13/03/08 14:43:47 INFO streaming.StreamJob: /home/chris/Downloads/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201303081155_0032
13/03/08 14:43:47 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303081155_0032
13/03/08 14:43:48 INFO streaming.StreamJob:  map 0%  reduce 0%
13/03/08 14:44:12 INFO streaming.StreamJob:  map 100%  reduce 100%
13/03/08 14:44:12 INFO streaming.StreamJob: To kill this job, run:
13/03/08 14:44:12 INFO streaming.StreamJob: /home/chris/Downloads/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201303081155_0032
13/03/08 14:44:12 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303081155_0032
13/03/08 14:44:12 ERROR streaming.StreamJob: Job not successful. Error: JobCleanup Task Failure, Task: task_201303081155_0032_m_000003
13/03/08 14:44:12 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

My HDFS is located at /app/hadoop/tmp which, I believe, is also the same as my /user/hduser directory on my hadoop instance.

Input data is located at /user/hduser/gutenberg/* (3 UTF plain text files) Output is set to be created at /user/hduser/gutenberg-output

标签： python hadoop hadoop-streaming

3条回答

在下西门庆

2楼-- · 2019-02-18 16:10

simliar to errors I was getting --

First, in : -file mapper.py -file reducer.py -mapper mapper.py -reducer reducer.py

you can use local system fully qualified paths on the '-file', and then relative on the '-mapper', eg.: -file /aFully/qualified/localSystemPathTo/yourMapper.py -mapper yourMapper.py

then: remember to include that "#!/usr/bin/python" at the top of the files 'reducer.py' and 'mapper.py'

finally,

in my mapper.py and reducer.py, I put all my imports within a 'setup_call()' function (vs. at the file's 'global' level), and then wrapped that with:

if __name__== '__main__':

    try:
        setup_call_andCloseOut()
    except: 
        import sys, traceback, StringIO

        fakeeWriteable = StringIO.StringIO()

        traceback.print_exc(None,  file=fakeeWriteable)
        msg = ""
        msg +="------------------------------------------------------\n"
        msg +="----theTraceback: -----------\n"
        msg += fakeeWriteable.getvalue() +  "\n"
        msg +="------------------------------------------------------\n"

        sys.stderr.write(msg)  

    #end

at that point, I was able to use the hadoop web job log (those http:// links in your error message), and navigate my way to seeing the 'stderr' messages.. ( from the actual core logic)

I'm sure there are other more concise ways to do all this, but this was both semantically clear and sufficient for my immediate needs

good luck..

0人赞添加讨论(0) 举报

叛逆

3楼-- · 2019-02-18 16:13

Have a look at the logs in the following path (based on the information supplied above):

$HADOOP_HOME$/logs/userlogs/job_201303081155_0032/task_201303081155_0032_m_000003

This should provide you with some information on that specific task.

The logs supplied by Hadoop are pretty good, it just takes some digging around to find the information :)

0人赞添加讨论(0) 举报

欢心

4楼-- · 2019-02-18 16:29

Sorry for the late response.

You should make sure your files (mapper and reducer) are executable by the hadoop user and contains the Shebang in the first line.

That will solve your problem.

0人赞添加讨论(0) 举报

Hadoop Streaming Command Failure with Python Error

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间