I try to create a matrix, for this I would like to tolower text.
For this I use this R instruction :
matrix = create_matrix(tweets[,1], toLower = TRUE, language="english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=TRUE)
Here the R code :
library(RTextTools)
library(e1071)
pos_tweets = rbind(
c('j AIME la voiture', 'positive'),
c('cette machine est performante', 'positive'),
c('je me sens en bonne forme ce matin', 'positive'),
c('je suis super excitée d aller voir le spectacle de demain', 'positive'),
c('il est mon meilleur ami', 'positive')
)
neg_tweets = rbind(
c('je séteste cette voiture', 'negative'),
c('ce film est horrible', 'negative'),
c('je suis fatiguée ce matin', 'negative'),
c('je déteste ce concert', 'negative'),
c('il n est pas mon ami', 'negative')
)
test_tweets = rbind(
c('je suis heureuse ce matin', 'negative'),
c('un bon ami', 'negative'),
c('je me sens triste', 'positive'),
c('pas belle cette maison', 'negative'),
c('mauvaise chanson', 'negative')
)
tweets = rbind(pos_tweets, neg_tweets, test_tweets)
# build dtm
matrix= create_matrix(tweets[,1], toLower = TRUE, language="french",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=TRUE)
The problem that I remark that there is words with capital letters in the matrix.
Can you explain to me please why I get this problem?
Thank you
As @chateaur said, it does perform the toLower internally, it just doesn't expose the contents of the pipeline at arbitrary points to you. RTextTools + tm build in severe structural limitations on what you can do, where, when and in what sequence in your pipeline. It's really frustrating. Avoid that...
I recommend you write your own pipeline, and the best open-source package I found for pipelines when I was investigating this recently was quanteda.
To illustrate the point it has an overloaded toLower() method you can use on strings, corpora, tokens - wherever you like, no restrictions, before or after stopword, punctuation removal and stemming. And it has tons of other useful methods for constructing your pipeline in whatever arbitrary sequence of steps you want, unlike RTextTools + tm. (You can also measure the usefulness of a package like quanteda by looking at the number/rate of active maintainers, commits, issues, fixes, releases, hits on github, SO, google, cleanness of the code and the API...).
Using RTextTools + tm on the frontend is sometimes painful, and often limiting. I simply found too many bugs, limitations, syntax quirks and annoyances with them - it killed my productivity and constantly drove me nuts. And it wasn't too performant either. You can still use (RTextTools +) tm for constructing and manipulating the DTM (and TF/TFIDF) matrices, and e1071 for the classifier.
Also: an honorable mention to qdap package for similarly adding useful tools at the document/discourse-level.
(PS: it's truly sad that R text-processing packages are so balkanized... so many people working at cross-purposes and furiously reinventing wheels... but sometimes that happens for several reasons.)