I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.
This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john"
.
I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.
At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!
Example data
text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
With the example data, I would want IDs 1 and 3 in the subsetted data.
Thanks for your help!
We can
paste
the 'text_patterns' with the|
, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.tableUpdate
If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (
str_detect
) and get thesum
of all the patterns with+
to create the logical vector for subsetting rows