I use split function to create an array in Hive, how can I get the first n elements from the array, and I want to go through the sub-array
code example
select col1 from table
where split(col2, ',')[0:5]
'[0:5]'looks likes python style, but it doesn't work here.
This is a tricky one.
First grab the brickhouse jar from here
Then add it to Hive : add jar /path/to/jars/brickhouse-0.7.0-SNAPSHOT.jar;
Now create the two functions we will be usings :
CREATE TEMPORARY FUNCTION array_index AS 'brickhouse.udf.collect.ArrayIndexUDF';
CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange';
The query will be :
select a,
n as array_index,
array_index(split(a,','),n) as value_from_Array
from ( select "abc#1,def#2,hij#3" a from dual union all
select "abc#1,def#2,hij#3,zzz#4" a from dual) t1
lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n
Explained :
select "abc#1,def#2,hij#3" a from dual union all
select "abc#1,def#2,hij#3,zzz#4" a from dual
Is just selecting some test data, in your case replace this with your table name.
lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n
numeric_range is a UDTF that returns a table for a given range, in this case, i asked for a range between 0 (default) and the number of elements in string (calculated as the number of commas + 1)
This way, each row will be multiplied by the number of elements in the given column.
array_index(split(a,','),n)
This is exactly like using split(a,',')[n]
but hive doesn't support it.
So we get the n-th element for each duplicated row of the initial string resulting in :
abc#1,def#2,hij#3,zzz#4 0 abc#1
abc#1,def#2,hij#3,zzz#4 1 def#2
abc#1,def#2,hij#3,zzz#4 2 hij#3
abc#1,def#2,hij#3,zzz#4 3 zzz#4
abc#1,def#2,hij#3 0 abc#1
abc#1,def#2,hij#3 1 def#2
abc#1,def#2,hij#3 2 hij#3
If you really want a specific number of elements (say 5) then just use :
lateral view numeric_range(5 ) n1 as n
This is a much simpler way of doing it. There is a UDF here called TruncateArrayUDF.java
that can do what you are asking. Just clone the repo from the main page and build the jar with Maven.
Example Data:
| col1 |
----------------------
1,2,3,4,5,6,7
11,12,13,14,15,16,17
Query:
add jar /complete/path/to/jar/brickhouse-0.7.0-SNAPSHOT.jar;
create temporary function trunc as 'brickhouse.udf.collect.TruncateArrayUDF';
select pos
,newcol
from (
select trunc(split(col1, '\\,'), 5) as p
from table
) x
lateral view posexplode(p) explodetable as pos, newcol
Output:
pos | newcol |
-------------------
0 1
1 2
2 3
3 4
4 5
0 11
1 12
2 13
3 14
4 15