I have a file containing a String, then a space and then a number on every line.
Example:
Line1: Word 2
Line2 : Word1 8
Line3: Word2 1
I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:
Line1: Word1 8 1
Line2: Word 2 2
Line3: Word2 1 3
Does anyone has an idea, how can I do it in Hadoop?
I am using java with Hadoop.
You could organize your map/reduce computation like this:
Map input: default
Map output: "key: number, value: word"
_ sorting phase by key _
Here you will need to override the default sorter to sort in decreasing order.
Reduce - 1 reducer
Reduce input: "key: number, value: word"
Reduce output: "key: word, value: (number, rank)"
Keep a global counter. For each key-value pair add the rank by incrementing the counter.
Edit: Here is a code snipped of a custom descendant sorter:
public static class IntComparator extends WritableComparator {
public IntComparator() {
super(IntWritable.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();
return v1.compareTo(v2) * (-1);
}
}
Don't forget to actually set it as the comparator for your job:
job.setSortComparatorClass(IntComparator.class);
Hadoop Streaming - Hadoop 1.0.x
According to this, after the
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.*.jar
you add a comparator
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
you specify the kind of sorting you want
-D mapred.text.key.comparator.options=-[ options]
where the [ options] are similar to Unix sort. Here are some examples,
Reverse order
-D mapred.text.key.comparator.options=-r
Sort on numeric values
-D mapred.text.key.comparator.options=-n
Sort on value or whatever field
-D mapred.text.key.comparator.options=-kx,y
with the -k flag you specify the key of sorting. The x, y parameters define this key. So, if you have a line with more than one tokens, you can choose which token of all will be the key of sorting or which combination of tokens will be the key of sorting. See the references for more details and examples.
I devised the solution to this problem. It was simple actually.
For sorting by value you need to use
setOutputValueGroupingComparator(Class)
For sorting in decreasing order you need to use setSortComparatorClass(LongWritable.DecreasingComparator.class);
For ranking you need to use
Counter class
, getCounter
and increment
function.