Sorting by value in Hadoop from a file

2019-06-25 12:37发布

问题:

I have a file containing a String, then a space and then a number on every line.

Example:

Line1: Word 2
Line2 : Word1 8
Line3: Word2 1

I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:

Line1: Word1 8 1
Line2: Word  2 2
Line3: Word2 1 3

Does anyone has an idea, how can I do it in Hadoop? I am using java with Hadoop.

回答1:

You could organize your map/reduce computation like this:

Map input: default

Map output: "key: number, value: word"

_ sorting phase by key _

Here you will need to override the default sorter to sort in decreasing order.

Reduce - 1 reducer

Reduce input: "key: number, value: word"

Reduce output: "key: word, value: (number, rank)"

Keep a global counter. For each key-value pair add the rank by incrementing the counter.

Edit: Here is a code snipped of a custom descendant sorter:

public static class IntComparator extends WritableComparator {

    public IntComparator() {
        super(IntWritable.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
            byte[] b2, int s2, int l2) {

        Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
        Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();

        return v1.compareTo(v2) * (-1);
    }
}

Don't forget to actually set it as the comparator for your job:

job.setSortComparatorClass(IntComparator.class);


回答2:

Hadoop Streaming - Hadoop 1.0.x

According to this, after the

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.*.jar
  1. you add a comparator

    -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

  2. you specify the kind of sorting you want

    -D mapred.text.key.comparator.options=-[ options]

where the [ options] are similar to Unix sort. Here are some examples,

Reverse order

-D mapred.text.key.comparator.options=-r

Sort on numeric values

-D mapred.text.key.comparator.options=-n

Sort on value or whatever field

-D mapred.text.key.comparator.options=-kx,y

with the -k flag you specify the key of sorting. The x, y parameters define this key. So, if you have a line with more than one tokens, you can choose which token of all will be the key of sorting or which combination of tokens will be the key of sorting. See the references for more details and examples.



回答3:

I devised the solution to this problem. It was simple actually.

For sorting by value you need to use

setOutputValueGroupingComparator(Class)

For sorting in decreasing order you need to use setSortComparatorClass(LongWritable.DecreasingComparator.class);

For ranking you need to use Counter class, getCounter and increment function.