I can't understand HiveQL rank(). I've found a couple of implementations of rank UDF's on the WWW, such as Edward's nice example. I can load and access the functions, but I can't get them to do what I want. Here is a detailed example:
Loading the UDF into the CLI process:
$ javac -classpath /home/hadoop/hadoop/hadoop-core-1.0.4.jar:/home/hadoop/hive/lib/hive-exec-0.10.0.jar com/m6d/hiveudf/Rank2.java
$ jar -cvf Rank2.jar com/m6d/hiveudf/Rank2.class
hive> ADD JAR /home/hadoop/MyDemo/Rank2.jar;
hive> CREATE TEMPORARY FUNCTION Rank2 AS 'com.m6d.hiveudf.Rank2';
Create a table:
create table purchases (
SalesRepId String,
PurchaseOrderId INT,
Amount INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
Load data from this CSV:
Jana,1,100
Nadia,2,200
Nadia,3,600
Daniel,4,80
Jana,5,120
William,6,170
Daniel,7,140
With this from the CLI:
LOAD DATA
LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
INTO TABLE purchases;
Now I can see my top Sales Reps:
select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC;
Nadia has sold $800 of stuff, Daniel and Jana have both sold $220, and William has sold $170
SalesRep Amount
-------- ------
Nadia 800
Daniel 220
Jana 220
William 170
Now all I want to do is number them: Nadia is #1, Daniel and Jana are tied for #2, and William is #4 (not #3)
select SalesRepId, V.volume,rank2(V.volume)
from
(select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC) V;
This is what I get, but NOT what I want:
SalesRep Amount Rank
-------- ------ ----
Nadia 800 1
Daniel 220 1
Jana 220 2
William 170 1
This is what I WANT, but I can't make hive do it for me:
SalesRep Amount Rank
-------- ------ ----
Nadia 800 1
Daniel 220 2
Jana 220 2
William 170 4
Can you help me with the correct HiveQL to rank my Sales Reps?
Thanks to JtheRocker for his response. His change resulted in this list:
SalesRep Amount Rank
-------- ------ ----
William 170 1
Daniel 220 2
Jana 220 2
Nadia 800 3
A slight modification to show Nadia as 4th (not 3rd):
private row_number;
@Override
public Object evaluate(DeferredObject[] currentKey) throws HiveException {
row_number++;
if (!sameAsPreviousKey(currentKey)) {
this.counter = row_number;
copyToPreviousKey(currentKey);
}
return new Long(this.counter);
}
With the Windowing and Analytics functions introduced in Hive 0.11, you can use:
If you have a evaluate function as below, assuming you are using the function directly form the guide you mentioned,
try changing it to the following, so that the counter is not reset when it finds a new volume, rather you don't increment if you find the same volume but increment only when it finds a new volume.
Tell me if this helps.