In R, is it possible to extract group capture from a regular expression match? As far as I can tell, none of grep
, grepl
, regexpr
, gregexpr
, sub
, or gsub
return the group captures.
I need to extract key-value pairs from strings that are encoded thus:
\((.*?) :: (0\.[0-9]+)\)
I can always just do multiple full-match greps, or do some outside (non-R) processing, but I was hoping I can do it all within R. Is there's a function or a package that provides such a function to do this?
gsub() can do this and return only the capture group:
However, in order for this to work, you must explicitly select elements outside your capture group as mentioned in the gsub() help.
So if your text to be selected lies in the middle of some string, adding .* before and after the capture group should allow you to only return it.
gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213"
gsub does this, from your example:
you need to double escape the \s in the quotes then they work for the regex.
Hope this helps.
Solution with
strcapture
from theutils
:This is how I ended up working around this problem. I used two separate regexes to match the first and second capture groups and run two
gregexpr
calls, then pull out the matched substrings:str_match()
, from thestringr
package, will do this. It returns a character matrix with one column for each group in the match (and one for the whole match):As suggested in the
stringr
package, this can be achieved using eitherstr_match()
orstr_extract()
.Adapted from the manual:
Extracting and combining our groups:
Indicating groups with an output matrix (we're interested in columns 2+):