what i have in output is:
word , file ----- ------ wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1
what i want is:
word , file ----- ------ wordx Doc2, Doc1
public static class LineIndexMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
private final static Text word = new Text();
private final static Text location = new Text();
public void map(LongWritable key, Text val,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
boolean first = true;
StringBuilder toReturn = new StringBuilder();
while (values.hasNext()) {
if (!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(values.next().toString());
}
output.collect(key, new Text(toReturn.toString()));
}
}
for the best performance - where should i skip the recurring file name? map,reduce or both? ps: i am a beginner in writing MR tasks and also trying to figure out programming logic with my question.
You can improve performance by doing local map aggregation and introducing a combiner - basically you want to reduce the amount of data being transmitted between your mappers and reducers
Local map aggregation is a concept where by you maintain a LRU like map (or set) of output pairs. In your case a set of words for the current mapper document (assuming you have a single document per map). This way you can lookup the word in the set, and only output a K,V pair if the set doesn't already contain that word (indicating you haven't already output an entry for it). If the set doesn't contain the word, output the word, docid pair, and update the set with the word.
If the set get's too big (say 5000 or 10000 entries), then clear it out and start over. This way you'll see the number of values output from the mapper dramatically (if your value domain or set of values is small, words are a good example for this).
You can also introduce your reducer logic in the combiner phase too
Once final word of warning - be vary careful about adding the Key / Value objects into sets (like in Matt D's answer), hadoop re-uses objects under the hood, so don't be surprised if you get unexpected results if you add in the references - always create a copy of the object.
There's an article on local map aggregation (for the word count example) that you may find useful:
You will only be able to remove duplicates in the Reducer. To do so, you can use a Set, which does not allow duplicates.
Edit: Adds copy of value to Set as per Chris' comment.