In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:
- logs/Customer_One/2011-01-02-001
- logs/Customer_One/2012-02-03-001
- logs/Customer_One/2012-02-03-002
- logs/Customer_Two/2009-03-03-001
- logs/Customer_Two/2009-03-03-002
Each individual log set may itself be five or six levels deep and contain thousands of files.
Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!
Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):
$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .
[ . . . ]
12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$
Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?