I want to analyze a big (n=500,000) corpus of documents. I am using quanteda
in the expectation that will be faster than tm_map()
from tm
. I want to proceed step by step instead of using the automated way with dfm()
. I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures.
I would like this sequence to be implemented:
1) remove the punctuation and numbers
2) remove stopwords (i.e. before the tokenization to avoid useless tokens)
3) tokenize using unigrams and bigrams
4) create the dfm
My attempt:
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))
> class(text.corpus)
[1] "corpus" "list"
> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"
# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
Bonus question
How do I remove sparse tokens in quanteda
? (i.e. equivalent of removeSparseTerms()
in tm
.
UPDATE
At the light of @Ken's answer, here is the code to proceed step by step with quanteda
:
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
1) Remove custom punctuation and numbers. E.g. notice that the "\n" in the ie2010 corpus
text.corpus <- ie2010Corpus
texts(text.corpus)[1] # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is
texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1]) # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
A further note on the reason why one may prefer to preprocess. My present corpus is in Italian, a language that has articles connected to the words with an apostrophe. Thus, the straight dfm()
can lead to inexact tokenization.
e.g.:
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
will produce two separated tokens for the same word ("un'abile" and "l'abile"), hence the need of an additional step with gsub()
here.
2) In quanteda
it is not possible to remove stopwords directly in the text before the tokenization. In my previous example "l" and "un" have to be removed not to produce misleading bigrams. This can be handled in tm
with tm_map(..., removeWords)
.
3) Tokenization
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
4) Create the dfm:
dfm <- dfm(token)
5) Remove sparse features
dfm <- trim(dfm, minCount = 5)