Mapping the topic of the review in R

2020-08-01 06:09发布

I have two data sets, Review Data & Topic Data

Dput code of my Review Data

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Dput code of my Topic Data

structure(list(word = structure(2:1, .Label = c("canteen food", 
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen", 
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor"), 
    Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

2条回答
2楼-- · 2020-08-01 06:12

Amateur here. I did this using base R, not dplyr, since I'm not the best at join functions.

Below, initialize your dfs. I added more examples to make sure everything was working properly. Also chose not to use factors, makes things messy for assigning strings later.

# initialize your dfs
review <- data.frame("Review" = c("Canteen Food could be improved", 
                                  "Sports and physical exercise need to be given importance",
                                  "canteen food x2",
                                  "this is my sports and physical",
                                  "SPORTS AND PHYSICAL",
                                  "meme",
                                  "canteen and food",
                                  "this is my meme",
                                  "memethis"
                                  ),
                     stringsAsFactors = F)

topic <- data.frame("word" = c("canteen food", "sports and physical", "meme"), 
                    "Topic" = c("Canteen", "Sports", "meme_cat"),
                    stringsAsFactors = F)

Then just used some nested for loops to iterate over your desired words, find matching strings, and assign the relevant topic. And initialized everything BEFORE the for loop.

# initialize new column to write into in loop
review <- cbind(review, "Topic" = rep(NA, nrow(review)))

# initialize before for loop
a <- rep(F, nrow(topic))

# loop over words in topic and find string matches in review. if so, assign review$topic = Topic
for (i in 1:nrow(topic)) {
  for(j in 1:nrow(review)) {
    a[j] <- grepl(topic$word[i], review$Review[j], ignore.case=T)
  }
  if (any(a)) {
    review$Topic[a] = topic$Topic[i]
  }

review
#                                                    Review    Topic
#1                           Canteen Food could be improved  Canteen
#2 Sports and physical exercise need to be given importance   Sports
#3                                          canteen food x2  Canteen
#4                           this is my sports and physical   Sports
#5                                      SPORTS AND PHYSICAL   Sports
#6                                                     meme meme_cat
#7                                         canteen and food     <NA>
#8                                          this is my meme meme_cat
#9                                                 memethis meme_cat
查看更多
Explosion°爆炸
3楼-- · 2020-08-01 06:32

What you want is something like a fuzzy join. Here's a brute-force looking for strict substring (but case-insensitive):

library(dplyr)
review %>%
  full_join(topic, by = character()) %>% # full cartesian expansion
  group_by(word) %>%
  mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
  ungroup() %>%
  filter(matched) %>%
  select(-word, -matched)
# # A tibble: 2 x 2
#   Review                                                   Topic    
#   <fct>                                                    <fct>    
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved                           "Canteen"

It's a little brute-force in that it does a cartesian join of the frames before testing with grepl, but ... you can't really avoid some parts of that.

You can also use the fuzzyjoin package, which is meant for joins on fuzzy things (appropriately named).

fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
#                                                     Review                word   Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports 
# 2                           Canteen Food could be improved        canteen food Canteen

The warning is because your columns are factors, not character, it should be harmless. If you want to hide the warning, you can use suppressWarnings (a little strong); if you want to prevent the warning, convert all applicable columns from factor to character (e.g., topic[] <- lapply(topic, as.character), same for review$Review, though modify it if you have numeric columns).

查看更多
登录 后发表回答