r- grepl to find multiple strings exists

2019-04-13 08:43发布

grepl("instance|percentage", labelTest$Text)

will return true if any one of instance or percentage is present.

How will i get true only when both the terms are present.

标签: r grepl
2条回答
看我几分像从前
2楼-- · 2019-04-13 09:03
Text <- c("instance", "percentage", "n", 
          "instance percentage", "percentage instance")

grepl("instance|percentage", Text)
# TRUE  TRUE FALSE  TRUE  TRUE

grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE  TRUE

The latter one works by looking for:

('instance')(any character sequence)('percentage')  
OR  
('percentage')(any character sequence)('instance')

Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.

Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.

# create a vector of word combinations
words <- c("instance", "percentage", "element",
           "character", "n", "o", "p")
Text2 <- combn(words, 5, function(x) paste(x, collapse=" "))

longperl <- grepl("(?=.*instance)
                   (?=.*percentage)
                   (?=.*element)
                   (?=.*character)", Text2, perl=TRUE)

# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) & 
          grepl("percentage", Text2) & 
             grepl("element", Text2) & 
           grepl("character", Text2)

# they produce identical results
all(longperl == longstrd)
查看更多
Melony?
3楼-- · 2019-04-13 09:04

Use intersect and feed it a grep for each word

library(data.table) #used for subsetting text vector below

vector_of_text[ intersect( grep(vector_of_text , pattern = "pattern1") , grep(vector_of_text , pattern = "pattern2") ) ]

查看更多
登录 后发表回答