I have two data sets, Review Data & Topic Data
Dput code of my Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput code of my Topic Data
structure(list(word = structure(2:1, .Label = c("canteen food",
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen",
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor"),
Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Amateur here. I did this using base R, not dplyr, since I'm not the best at join functions.
Below, initialize your dfs. I added more examples to make sure everything was working properly. Also chose not to use factors, makes things messy for assigning strings later.
Then just used some nested for loops to iterate over your desired words, find matching strings, and assign the relevant topic. And initialized everything BEFORE the for loop.
What you want is something like a fuzzy join. Here's a brute-force looking for strict substring (but case-insensitive):
It's a little brute-force in that it does a cartesian join of the frames before testing with
grepl
, but ... you can't really avoid some parts of that.You can also use the
fuzzyjoin
package, which is meant for joins on fuzzy things (appropriately named).The warning is because your columns are
factor
s, notcharacter
, it should be harmless. If you want to hide the warning, you can usesuppressWarnings
(a little strong); if you want to prevent the warning, convert all applicable columns fromfactor
tocharacter
(e.g.,topic[] <- lapply(topic, as.character)
, same forreview$Review
, though modify it if you have numeric columns).