Hive: split string using regex

2019-08-08 21:16发布

问题:

I have a string of words that are :: delimited. How can I use the Hive UDF regexp_extract() to extract words from the string?

回答1:

regexp_extract('2foa1fa::12hjk','^(\\w.*)\\:{2}(\\w.*)$',1) as word1

OUTPUT: 2foa1fa

regexp_extract('2foa1fa::12hjk','^(\\w.*)\\:{2}(\\w.*)$',2) as word2

OUTPUT: 12hjk

  • ^ anchors to the beginning of the string
  • The \\w looks for a word character and .* means any number of times
  • The \\:{2} looks for two : in a row (this is your :: delimiter)
  • $ anchors the string to the end of the string
  • specifying the third parameter in regexp_extract extracts the indexed (pattern)

Now just stick your column name in the place of the string literal and you should be good to go.

You can also use a split function creating an array and then query by the array location as well. Which will look something like this:

select my_array[2] from 
    (select split('2foa1fa::12hjk','\\::') as my_array from my_table) b;

OUTPUT: 12hjk