I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.
I can see the files I wish to search like this:
bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time
..which returns several entries like this:
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab
How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc
? Once I know, I can edit them manually.
Using
hadoop fs -cat
(or the more generichadoop fs -text
) might be feasible if you just have two 1 GB files. For 100 files though I would use the streaming-api because it can be used for adhoc-queries without resorting to a full fledged mapreduce job. E.g. in your case create a scriptget_filename_for_pattern.sh
:Note that you have to read the whole input, in order to avoid getting
java.io.IOException: Stream closed
exceptions.Then issue the commands
In newer distributions
mapred streaming
instead ofhadoop jar $HADOOP_HOME/hadoop-streaming.jar
should work. In the latter case you have to set your$HADOOP_HOME
correctly in order to find the jar (or provide the full path directly).For simpler queries you don't even need a script but just can provide the command to the
-mapper
parameter directly. But for anything slightly complex it's preferable to use a script, because getting the escaping right can be a chore.If you don't need a reduce phase provide the symbolic
NONE
parameter to the respective-reduce
option (or just use-numReduceTasks 0
). But in your case it's useful to have a reduce phase in order to have the output consolidated into a single file.You are looking to applying grep command on hdfs folder
here cat recursively goes through all files in the folder and I have applied grep to find count.
This is a hadoop "filesystem", not a POSIX one, so try this:
This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:
Notice the
-P 10
option toxargs
: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.EDIT: Given that you're on SunOS (which is slightly brain-dead) try this: