Wordcount common words of files

2019-08-27 02:47发布

I Have managed to run the Hadoop wordcount example in a non-distributed mode; I get the output in a file named "part-00000"; I can see that it lists all words of all input files combined.

After tracing the wordcount code I can see that it takes lines and splits the words based on spaces.

I am trying to think of a way to just list the words that have occurred in multiple files and their occurrences? can this be achieved in Map/Reduce? -Added- Are these changes appropriate?

      //changes in the parameters here

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

         // These are the original line; I am not using them but left them here...
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

                    //My changes are here too

        private Text outvalue=new Text();
        FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
        private String filename = fileSplit.getPath().getName();;



      public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());

          //    And here        
              outvalue.set(filename);
          output.collect(word, outvalue);

        }

      }

    }

1条回答
放我归山
2楼-- · 2019-08-27 03:21

You could amend the mapper to output the word as a the key, and then a Text as the value representing the filename of where the word came from. Then in your reducer, you just need to dedup the file names and output those entries where the word appears in more than a single file.

To get the filename of the file being processed depends on whether you're using the new API or not (mapred or mapreduce package names). I know for the new API you can extract the mapper input split from the Context object using the getInputSplit method (then probably case the InputSplit to a FileSplit, assuming you are using the TextInputFormat). For the old API, i've never tried it, but apparently you can use a configuration property called map.input.file

This would also be a good choice for introducing a Combiner - to dedup out multiple word occurrences from the same mapper.

Update

So in response to your problem, you're trying to use an instance variable called reporter, which doesn't exist in the class scopt of the mapper, amend as follows:

public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
  // These are the original line; I am not using them but left them here...
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  //My changes are here too
  private Text outvalue=new Text();
  private String filename = null;

  public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    if (filename == null) {
      filename = ((FileSplit) reporter.getInputSplit()).getPath().getName();
    }

    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());

      //    And here        
      outvalue.set(filename);
      output.collect(word, outvalue);
    }
  }
}

(really not sure why SO isn't respecting the formatting in the above...)

查看更多
登录 后发表回答