Hadoop/Hive Collect_list without repeating items

2019-03-02 07:41发布

Based on the post, Hive 0.12 - Collect_list, I am trying to locate Java code to implement a UDAF that will accomplish this or similar functionality but without a repeating sequence.

For instance, collect_all() returns a sequence A, A, A, B, B, A, C, C I would like to have the sequence A, B, A, C returned. Sequentially repeated items would be removed.

Does anyone know of a function in Hive 0.12 that will accomplish or has written their own UDAF?

As always, thanks for the help.

2条回答
狗以群分
2楼-- · 2019-03-02 08:19

if you have soemthing like this

index  value
1       A
2       A
3       A
4       B
5       B
6       A
7       c
8       c

Where index is some rank order value such as an index directly or something like a date. I assume order matters in your situation.

Then Query:

select collect_all(value)
from
  (select index, value 
   from table) a
   left outer join
  (select index, 
     last_value(value) over (order by index row between current row and 1 following) as nextvalue 
   from table) b
  on a.index=b.index
  where value <> nextvalue
;

The problem here is that you won't get the last value of C because there is no next value, so add or nextvalue is null and you should have the results.

select collect_all(value)
from
  (select index, value 
   from table) a
   left outer join
  (select index, 
     last_value(value) over (order by index row between current row and 1 following) as nextvalue 
   from table) b
  on a.index=b.index
  where (value <> nextvalue) or (nextvalue is null)
;

This should yield [ "A", "B", "A", "C"]

查看更多
看我几分像从前
3楼-- · 2019-03-02 08:24

I ran into a similar problem awhile back. I didn't want to have to write a full-on UDAF so I just did a combo with brickhouse collect and my own UDF. Say you have this data

id  value
1   A
1   A
1   A
1   B
1   B
1   A
1   C
1   C
1   D
2   D
2   D
2   D
2   D
2   F
2   F
2   F
2   A
2   W
2   A

my UDF was

package com.something;

import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveSequentialDuplicates extends UDF {
    public ArrayList<Text> evaluate(ArrayList<Text> arr) {
        ArrayList<Text> newList = new ArrayList<Text>();
        newList.add(arr.get(0));
        for (int i=1; i<arr.size(); i++) {

            String front = arr.get(i).toString();
            String back = arr.get(i-1).toString();

            if (!back.equals(front)) {
                newList.add(arr.get(i));
            }
        }
        return newList;
    }
}

and then my query was

add jar /path/to/jar/brickhouse-0.7.1.jar;
add jar /path/to/other/jar/duplicates.jar;

create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates';
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select id
  , remove_seq_dups(value_array) no_dups
from (
  select id
    , collect(value) value_array
  from db.table
  group by id ) x

output

1   ["A","B","A","C","D"]
2   ["D","F","A","W","A"]

As an aside, the built-in collect_list will not necessary keep the elements of the list in the order they were grouped in; brickhouse collect will. hope this helps.

查看更多
登录 后发表回答