How to take a word and create an indicator variabl

2019-07-11 02:48发布

I have a vector of words and a a vector of comments:

word.list <- c("very", "experience", "glad")

comments  <- c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad")

I would like to create a data frame that looks like

df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad"),
               very = c(1,0,0,0,1),
               glad = c(0,1,0,0,1),
               experience = c(1,0,0,1,0))

I have 12,000+ comments and 20 words I would like to do this with. How do I go about doing this efficiently? For loops? Any other method?

标签: r regex grepl
3条回答
等我变得足够好
2楼-- · 2019-07-11 03:11

Using base-R, this code will loop through the list of words and each comment, and check whether each word exists among the split comment (splitting by spaces and punctuation marks), then recombining as a data frame...

df <- as.data.frame(do.call(cbind,lapply(word.list,function(w) 
          as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \\.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)

df
                                                                        comments very experience glad
1 very good experience. first time I have been and I would definitely come back.    1          1    0
2                                               glad I scheduled an appointment.    0          0    1
3                                            the staff have become more cordial.    0          0    0
4                                      the experience i had was not good at all.    0          1    0
5                                                                 i am very glad    1          0    1
查看更多
男人必须洒脱
3楼-- · 2019-07-11 03:12

Loop through word.list and use grepl:

sapply(word.list, function(i) as.numeric(grepl(i, comments)))

To have pretty output, convert to a dataframe:

data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))

Note: grepl will match "very" with "veryX". If this is not desired then this needs complete word matching.

# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\\b", i, "\\b"), comments)))
查看更多
该账号已被封号
4楼-- · 2019-07-11 03:32

One way is a combination of stringi and gdapTools package, i.e.

library(stringi)
library(qdapTools)

mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
#  experience glad very
#1          1    0    1
#2          0    1    0
#3          0    0    0
#4          1    0    0
#5          0    1    1

You can then use cbind or data.frame to bind,

cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|'))))) 
查看更多
登录 后发表回答