I Have managed to run the Hadoop wordcount example in a non-distributed mode; I get the output in a file named "part-00000"; I can see that it lists all words of all input files combined.
After tracing the wordcount code I can see that it takes lines and splits the words based on spaces.
I am trying to think of a way to just list the words that have occurred in multiple files and their occurrences? can this be achieved in Map/Reduce? -Added- Are these changes appropriate?
//changes in the parameters here
public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
// These are the original line; I am not using them but left them here...
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//My changes are here too
private Text outvalue=new Text();
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
private String filename = fileSplit.getPath().getName();;
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
// And here
outvalue.set(filename);
output.collect(word, outvalue);
}
}
}
You could amend the mapper to output the word as a the key, and then a Text as the value representing the filename of where the word came from. Then in your reducer, you just need to dedup the file names and output those entries where the word appears in more than a single file.
To get the filename of the file being processed depends on whether you're using the new API or not (mapred or mapreduce package names). I know for the new API you can extract the mapper input split from the Context object using the getInputSplit method (then probably case the
InputSplit
to aFileSplit
, assuming you are using theTextInputFormat
). For the old API, i've never tried it, but apparently you can use a configuration property calledmap.input.file
This would also be a good choice for introducing a Combiner - to dedup out multiple word occurrences from the same mapper.
Update
So in response to your problem, you're trying to use an instance variable called reporter, which doesn't exist in the class scopt of the mapper, amend as follows:
(really not sure why SO isn't respecting the formatting in the above...)