R Text Mining with quanteda

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.

# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")

# Read File
# Facebooks posts could be generated by  FB Netvizz 
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file 
fbpost <- read.csv("D:/FB-com.csv", sep=";")

# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)

# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)

Everything works until:

> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
   ... indexing 2,760 documents
   ... tokenizing texts, found 77,923 total tokens
   ... cleaning the tokens, 1584 removed entirely
   ... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1",  : 
  invalid 'dimnames' given for data frame

How would you interpret the error message? Are there any suggestions to solve the problem?

标签： r text-mining text-analysis quanteda

1条回答

神经病院院长

2楼-- · 2019-04-16 20:24

There was a bug in quanteda version 0.7.2 that caused dfm() to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.

This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in dfm(), resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)

devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
##    ... indexing 57 documents
##    ... lowercasing
##    ... tokenizing
##    ... shaping tokens into data.table, found 134,024 total tokens
##    ... applying a dictionary consisting of 68 key entries
##    ... summing dictionary-matched features by document
##    ... indexing 68 feature types
##    ... building sparse matrix
##    ... created a 57 x 68 sparse dfm
##    ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers   Nonfl   Swear      TV  Eating   Sleep   Groom   Death  Sports  Sexual 
##       0       0       0      42      47      49      53      76      81     100

It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older dfm you are using with your Facebook texts.

mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams 
##          3

0人赞添加讨论(0) 举报

R Text Mining with quanteda

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间