I have two data sets, Review Data & Topic Data
Dput code of my Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput code of my Topic Data
structure(list(word = structure(2:1, .Label = c("canteen food",
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen",
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor"),
Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
What you want is something like a fuzzy join. Here's a brute-force looking for strict substring (but case-insensitive):
library(dplyr)
review %>%
full_join(topic, by = character()) %>% # full cartesian expansion
group_by(word) %>%
mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
ungroup() %>%
filter(matched) %>%
select(-word, -matched)
# # A tibble: 2 x 2
# Review Topic
# <fct> <fct>
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved "Canteen"
It's a little brute-force in that it does a cartesian join of the frames before testing with grepl
, but ... you can't really avoid some parts of that.
You can also use the fuzzyjoin
package, which is meant for joins on fuzzy things (appropriately named).
fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
# Review word Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports
# 2 Canteen Food could be improved canteen food Canteen
The warning is because your columns are factor
s, not character
, it should be harmless. If you want to hide the warning, you can use suppressWarnings
(a little strong); if you want to prevent the warning, convert all applicable columns from factor
to character
(e.g., topic[] <- lapply(topic, as.character)
, same for review$Review
, though modify it if you have numeric columns).
Amateur here. I did this using base R, not dplyr, since I'm not the best at join functions.
Below, initialize your dfs. I added more examples to make sure everything was working properly. Also chose not to use factors, makes things messy for assigning strings later.
# initialize your dfs
review <- data.frame("Review" = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance",
"canteen food x2",
"this is my sports and physical",
"SPORTS AND PHYSICAL",
"meme",
"canteen and food",
"this is my meme",
"memethis"
),
stringsAsFactors = F)
topic <- data.frame("word" = c("canteen food", "sports and physical", "meme"),
"Topic" = c("Canteen", "Sports", "meme_cat"),
stringsAsFactors = F)
Then just used some nested for loops to iterate over your desired words, find matching strings, and assign the relevant topic. And initialized everything BEFORE the for loop.
# initialize new column to write into in loop
review <- cbind(review, "Topic" = rep(NA, nrow(review)))
# initialize before for loop
a <- rep(F, nrow(topic))
# loop over words in topic and find string matches in review. if so, assign review$topic = Topic
for (i in 1:nrow(topic)) {
for(j in 1:nrow(review)) {
a[j] <- grepl(topic$word[i], review$Review[j], ignore.case=T)
}
if (any(a)) {
review$Topic[a] = topic$Topic[i]
}
review
# Review Topic
#1 Canteen Food could be improved Canteen
#2 Sports and physical exercise need to be given importance Sports
#3 canteen food x2 Canteen
#4 this is my sports and physical Sports
#5 SPORTS AND PHYSICAL Sports
#6 meme meme_cat
#7 canteen and food <NA>
#8 this is my meme meme_cat
#9 memethis meme_cat