hadoop inverted-index without recurrence of file n

what i have in output is:

word , file ----- ------ wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1

what i want is:

word , file ----- ------ wordx Doc2, Doc1

public static class LineIndexMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, Text, Text> {

    private final static Text word = new Text();
    private final static Text location = new Text();

    public void map(LongWritable key, Text val,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {
        FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
        String fileName = fileSplit.getPath().getName();
        location.set(fileName);

        String line = val.toString();
        StringTokenizer itr = new StringTokenizer(line.toLowerCase());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            output.collect(word, location);
        }
    }
}

public static class LineIndexReducer extends MapReduceBase
        implements Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterator<Text> values,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {

        boolean first = true;
        StringBuilder toReturn = new StringBuilder();
        while (values.hasNext()) {
            if (!first) {
                toReturn.append(", ");
            }
            first = false;
            toReturn.append(values.next().toString());
        }

        output.collect(key, new Text(toReturn.toString()));
    }
}

for the best performance - where should i skip the recurring file name? map,reduce or both? ps: i am a beginner in writing MR tasks and also trying to figure out programming logic with my question.

标签： hadoop inverted-index

2条回答

霸刀☆藐视天下

2楼-- · 2019-04-16 16:37

You can improve performance by doing local map aggregation and introducing a combiner - basically you want to reduce the amount of data being transmitted between your mappers and reducers

Local map aggregation is a concept where by you maintain a LRU like map (or set) of output pairs. In your case a set of words for the current mapper document (assuming you have a single document per map). This way you can lookup the word in the set, and only output a K,V pair if the set doesn't already contain that word (indicating you haven't already output an entry for it). If the set doesn't contain the word, output the word, docid pair, and update the set with the word.

If the set get's too big (say 5000 or 10000 entries), then clear it out and start over. This way you'll see the number of values output from the mapper dramatically (if your value domain or set of values is small, words are a good example for this).

You can also introduce your reducer logic in the combiner phase too

Once final word of warning - be vary careful about adding the Key / Value objects into sets (like in Matt D's answer), hadoop re-uses objects under the hood, so don't be surprised if you get unexpected results if you add in the references - always create a copy of the object.

There's an article on local map aggregation (for the word count example) that you may find useful:

http://wikidoop.com/wiki/Hadoop/MapReduce/Mapper#Map_Aggregation

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2019-04-16 17:01

You will only be able to remove duplicates in the Reducer. To do so, you can use a Set, which does not allow duplicates.

public void reduce(Text key, Iterator<Text> values,
        OutputCollector<Text, Text> output, Reporter reporter)
        throws IOException {

    // Text's equals() method should be overloaded to make this work
    Set<Text> outputValues = new HashSet<Text>();

    while (values.hasNext()) {
      // make a new Object because Hadoop may mess with original
      Text value = new Text(values.next());

      // takes care of removing duplicates
      outputValues.add(value);
    }

    boolean first = true;
    StringBuilder toReturn = new StringBuilder();
    Iterator<Text> outputIter = outputValues.iter();
    while (outputIter.hasNext()) {
        if (!first) {
            toReturn.append(", ");
        }
        first = false;
        toReturn.append(outputIter.next().toString());
    }

    output.collect(key, new Text(toReturn.toString()));
}

Edit: Adds copy of value to Set as per Chris' comment.

0人赞添加讨论(0) 举报

hadoop inverted-index without recurrence of file n

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间