I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases.
# Sample data frame of articles
articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse"))
articles$text <- as.character(articles$text)
# Sample vector of keywords or phrases
keywords <- as.character(c("elit", "tempor incididunt", "reprehenderit"))
# id text
# 1 1 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# 2 2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
# 3 3 quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
# 4 4 consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
Given the vector of keywords, the subset should contain rows 1, 2, and 4, since those rows contain one or more of the elements of the vector.
Neither %in
nor grepl()
work, since %in%
seems to require that each word in the data frame be vectorized (articles$text %in% keywords
results in four FALSE
s), and grep()
doesn't seem to be able to handle vectorized patterns (grep(keywords, articles$text)
gives an error). Neither function alone seems to work well across multiple dimensions (i.e. it would be easy to search for one word in all the rows, but not all 3 at the same time).
What's the best way to find and select all rows of the data frame that contain at least one of the elements of the keyword vector?