How to take a word and create an indicator variabl

2019-07-11 03:24发布

问题:

I have a vector of words and a a vector of comments:

word.list <- c("very", "experience", "glad")

comments  <- c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad")

I would like to create a data frame that looks like

df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad"),
               very = c(1,0,0,0,1),
               glad = c(0,1,0,0,1),
               experience = c(1,0,0,1,0))

I have 12,000+ comments and 20 words I would like to do this with. How do I go about doing this efficiently? For loops? Any other method?

回答1:

Loop through word.list and use grepl:

sapply(word.list, function(i) as.numeric(grepl(i, comments)))

To have pretty output, convert to a dataframe:

data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))

Note: grepl will match "very" with "veryX". If this is not desired then this needs complete word matching.

# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\\b", i, "\\b"), comments)))


回答2:

One way is a combination of stringi and gdapTools package, i.e.

library(stringi)
library(qdapTools)

mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
#  experience glad very
#1          1    0    1
#2          0    1    0
#3          0    0    0
#4          1    0    0
#5          0    1    1

You can then use cbind or data.frame to bind,

cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|'))))) 


回答3:

Using base-R, this code will loop through the list of words and each comment, and check whether each word exists among the split comment (splitting by spaces and punctuation marks), then recombining as a data frame...

df <- as.data.frame(do.call(cbind,lapply(word.list,function(w) 
          as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \\.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)

df
                                                                        comments very experience glad
1 very good experience. first time I have been and I would definitely come back.    1          1    0
2                                               glad I scheduled an appointment.    0          0    1
3                                            the staff have become more cordial.    0          0    0
4                                      the experience i had was not good at all.    0          1    0
5                                                                 i am very glad    1          0    1


标签: r regex grepl