I'm a newcomer to Ubuntu, Hadoop and DFS but I've managed to install a single-node hadoop instance on my local ubuntu machine following the directions posted on Michael-Noll.com here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#copy-local-example-data-to-hdfs
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
I'm currently stuck on running the basic word count example on Hadoop. I'm not sure if the fact I've been running Hadoop out of my Downloads directory makes too much of a difference, but I've atempted to tweek around my file locations for the mapper.py and reducer.py functions by placing them in the Hadooop working directory with no success. I've exhausted all of my research and still cannot solve this problem (i.e.- using -file parameters, etc.) I really appreciate any help in advance and I hope I framed this question in a way that can help others who are just beginning with Python + Hadoop.
I tested the mapper.py and reduce.py independently and both work fine when prompted with toy text data from the bash shell.
Output from my Bash Shell:
hduser@chris-linux:/home/chris/Downloads/hadoop$ bin/hadoop jar /home/chris/Downloads/hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar -file mapper.py -file reducer.py -mapper mapper.py -reducer reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output3
Warning: $HADOOP_HOME is deprecated.
packageJobJar: [mapper.py, reducer.py, /app/hadoop/tmp/hadoop-unjar4681300115516015516/] [] /tmp/streamjob2215860242221125845.jar tmpDir=null
13/03/08 14:43:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/08 14:43:46 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/08 14:43:46 INFO mapred.FileInputFormat: Total input paths to process : 3
13/03/08 14:43:47 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
13/03/08 14:43:47 INFO streaming.StreamJob: Running job: job_201303081155_0032
13/03/08 14:43:47 INFO streaming.StreamJob: To kill this job, run:
13/03/08 14:43:47 INFO streaming.StreamJob: /home/chris/Downloads/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201303081155_0032
13/03/08 14:43:47 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303081155_0032
13/03/08 14:43:48 INFO streaming.StreamJob: map 0% reduce 0%
13/03/08 14:44:12 INFO streaming.StreamJob: map 100% reduce 100%
13/03/08 14:44:12 INFO streaming.StreamJob: To kill this job, run:
13/03/08 14:44:12 INFO streaming.StreamJob: /home/chris/Downloads/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201303081155_0032
13/03/08 14:44:12 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303081155_0032
13/03/08 14:44:12 ERROR streaming.StreamJob: Job not successful. Error: JobCleanup Task Failure, Task: task_201303081155_0032_m_000003
13/03/08 14:44:12 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
My HDFS is located at /app/hadoop/tmp which, I believe, is also the same as my /user/hduser directory on my hadoop instance.
Input data is located at /user/hduser/gutenberg/* (3 UTF plain text files) Output is set to be created at /user/hduser/gutenberg-output