I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same
library(tm)
library(RWeka)
txt <- read.csv("HW.csv",header=T)
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"
myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#building the TDM
btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?
Try this
Looking at the source of
findFreqTerms
, it appears that the functionslam::row_sums
does the trick when called on a term-document matrix. Try, for instance:Does
apply(myTdm, 1, sum)
orrowSums(as.matrix(myTdm))
give the ngram counts you're after?I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.
seems to work to get simple frequencies. I've used scan because I had a txt file, but it should work with read.csv too.
Depending on your needs, using some
tidyverse
functions might be a rough solution that offers some flexibility in terms of how you handle capitalization, punctuation, and stop words: