How object is getting generated in below code?

2020-05-03 11:34发布

问题:

I'm trying to understand one java code. (Basic knowledge of Java)

Here its is

WordCountMapper Class

package com.company;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        for (String word : line.split(" ")) {

            if (word.length() > 0) {
                context.write(new Text(word), new IntWritable(1));

        }

    }

Mapper Class

    package org.apache.hadoop.mapreduce;

import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience.Public;
import org.apache.hadoop.classification.InterfaceStability.Stable;

@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    public Mapper() {
    }

    protected void setup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context)
            throws IOException, InterruptedException {
    }

    protected void map(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context)
            throws IOException, InterruptedException {
        context.write(key, value);
    }

    protected void cleanup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context)
            throws IOException, InterruptedException {
    }

    public void run(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        setup(context);
        while (context.nextKeyValue()) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        }
        cleanup(context);
    }

    public abstract class Context implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
        public Context() {
        }

}

}

Main method class

    package com.company;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
public static void main(String[] args) throws Exception {
if(args.length !=2){
System.err.println("Invalid Command");
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(0);
}
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
}

My doubt is in WordCount class how Text value is coming into existance ? I mean its an object but where its getting generated, there is no sign in main method class to instantiate instance of Text class.

And what it means - , I have never seen this before creating class like in below format

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
{

Any suggestions ?

回答1:

The code you have pasted is meant to run using the Hadoop MapReduce framework.

Basically you have here three classes:

  • The WordCount mapper which seems to split strings and write these to the Hadoop streaming context
  • The Mapper class which is part of the Hadoop streaming libraries
  • The WordCount driver which submits the job to the Hadoop cluster

Actually I would have expected a WordCountReducer class in your question, but that seems not to be there.

Any way: the text will "come to existence" by copying it as a file to your Hadoop cluster and must be on HDFS (Hadoop File System) before you run the job.

This line of code refers to one HDFS path:

FileInputFormat.addInputPath(job, new Path(args[0]));

And regarding the question about the code:

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

These are generic types (see this tutorial here) which have to be declared each time you subclass a mapper.

Your WordCount mapper actually subclasses this Mapper class and specifies the four types:

public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable>

These are the correspondences:

KEYIN    = LongWritable
VALUEIN  = Text
KEYOUT   = Text
VALUEOUT = IntWritable


回答2:

The Hadoop API creates the necessary classes.

You can optionally set an InputFormat, and that needs to be the same as the input formats used by the class in the setMapperClass (the KEYIN, VALUEIN fields). Similarly, an output format is also set, and there's inputs and outputs for a Reducer.

The default format is TextInputFormat which reads LongWritable, Text key value pairs. The InputSplit class is responsible for reading the bytes off the FileSystem and creating the Writable classes which are passed to the Mapper.

Worth mentioning that nothing is created until you start the job like

System.exit(job.waitForCompletion(true) ? 0 : 1);


标签: java hadoop