Filter where there are at least two pattern matche

I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.

This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john".

I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.

At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!

Example data

text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
                                              "lucy has only moved here recently",
                                              "lucy and sarah are cousins",
                                              "john is also new to the area",
                                              "paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))

With the example data, I would want IDs 1 and 3 in the subsetted data.

Thanks for your help!

标签： r data.table subset

1条回答

地球回转人心会变

2楼-- · 2020-07-24 21:29

We can paste the 'text_patterns' with the |, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.table

library(data.table)
text_table[str_count(text, paste(text_patterns, collapse="|")) >1]
#    ID                                            text
#1:  1    lucy, sarah and paul live on the same street
#2:  3                      lucy and sarah are cousins
#3:  5 paul and john have known each other a long time

Update

If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (str_detect) and get the sum of all the patterns with + to create the logical vector for subsetting rows

i1 <- text_table[, Reduce(`+`, lapply(text_patterns, 
       function(x) str_detect(text, x))) >1]
text_table[i1]
#    ID                                         text
#1:  1 lucy, sarah and paul live on the same street
#2:  3                   lucy and sarah are cousins

0人赞添加讨论(0) 举报

Filter where there are at least two pattern matche

Update

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间