I am having some problems with regexp_extract:
I am querying on a tab-delimited file, the column I'm checking has strings that look like this:
abc.def.ghi
Now, if I do:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR job runs, it works, and I get "abc" from index 0.
But now, if I want to get "def" from index 1:
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive fails with:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Log file says:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
Am I doing something fundamentally wrong here?
Thanks,
Mario
From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.
It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.
0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...
Paraphrased from the manual:
regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
^ ^
groups 1 2
This returns 'bar'.
So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)
edit
I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.
It looks like you want a particular segment separated with a dot .
character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:
Returns the first segment: abc
.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)
Returns the second segment: abc.def
.ghi
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)
Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)
The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.
Some notes:
This regex ^(?:([^.]+)\.?){n}
has problems though.
It requires there be something between dots in the segment or the regex won't match ...
.
It could be this ^(?:([^.]*)\.?){n}
but this will match even if there is less than n-1 dots,
including the empty string. This is probably not desireable.
There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.
^(?:(?!\2)([^.]*)(?:\.|$())){2}
, everything else is the same.
So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1)
change {2} to whatever 'segment' is needed (this does segment 2).
and it still returns capture buffer 1 after the {N}'th iteration.
Here it is broken down
^ # Begining of string
(?: # Grouping
(?!\2) # Assertion: Capture buffer 2 is UNDEFINED
( [^.]*) # Capture buffer 1, optional non-dot chars, many times
(?: # Grouping
\. # Dot character
| # or,
$ () # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
) # End grouping
){3} # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)
If it doesn't do assertions, then this won't work!
I think you have to make 'groups' no?
select distinct regexp_extract(name, '([^.]+)', 1) from dummy;
(untested)
I think it behaves like the java library and this should work, let me know though.