I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.
Here is a simple toy example. Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.
tests = c(
"dog cat",
"dogs and cats",
"dog and cat",
"dog and fluffy cats",
"cats and dogs",
"cat and dog",
"fluffy cats and fluffy dogs")
Using this pattern, I can pull the strings that do have dog before cat:
pattern = "(dog(s|).*)(cat(s|))"
grep(pattern, tests, perl = TRUE, value = TRUE)
[1] "dog cat" "dogs and cats" "dog and cat" "dog and fluffy cats"
My negative lookbehind is having problems:
neg_pattern = "(?<!dog(s|).*)(cat(s|))"
grep(neg_pattern, tests, perl = TRUE, value = TRUE)
Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression
In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'
I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?