I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus
, I would like to end up with a preprocessed corpus that corresponds to the one produced using the following R code:
library(tm)
library(SnowballC)
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
Is there a simple or straightforward — preferably pre-built — method of doing this in Python? Is there a way to ensure exactly the same results?
For example, I would like to preprocess
@Apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had!
into
ear pod amaz best sound inear headphon ive ever
CountVectorizer
andTfidfVectorizer
can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:Demo:
(The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)
It seems tricky to get things exactly the same between
nltk
andtm
on the preprocessing steps, so I think the best approach is to userpy2
to run the preprocessing in R and pull the results into python:Then, you can load it into
scikit-learn
-- the only thing you'll need to do to get things to match between theCountVectorizer
and theDocumentTermMatrix
is to remove terms of length less than 3:Let's verify this matches with R:
As you can see, the number of stored elements and terms exactly match between the two approaches now.