Negative lookbehind in R with multi-word separatio

2019-08-26 00:49发布

问题:

I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.

Here is a simple toy example. Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.

 tests = c(
   "dog cat",
   "dogs and cats",
   "dog and cat", 
   "dog and fluffy cats",
   "cats and dogs", 
   "cat and dog",  
   "fluffy cats and fluffy dogs")  

Using this pattern, I can pull the strings that do have dog before cat:

 pattern = "(dog(s|).*)(cat(s|))"
 grep(pattern, tests, perl = TRUE, value = TRUE)

[1] "dog cat"  "dogs and cats"   "dog and cat"   "dog and fluffy cats"

My negative lookbehind is having problems:

 neg_pattern = "(?<!dog(s|).*)(cat(s|))"
 grep(neg_pattern, tests, perl = TRUE, value = TRUE)

Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression

In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'

I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?

回答1:

I hope that this can help:

tests = c(
  "dog cat",
  "dogs and cats",
  "dog and cat", 
  "dog and fluffy cats",
  "cats and dogs", 
  "cat and dog",  
  "fluffy cats and fluffy dogs"
)

# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]

# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]

tests

[1] "cats and dogs"               "cat and dog"                
[3] "fluffy cats and fluffy dogs"

I'm not sure if you wanted to do this with one expression, but Regex can still be very useful when applied iteratively.