I have a (small) problem with the tm r library. say I have a corpus:
# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
Result:
[1] "1" "2" "3" "4" "5"
This works. But when I try to use a transformation tm_map():
# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)
Gives
Error: inherits(doc, "TextDocument") is not TRUE
The solution proposed in this case was to transform to PlainTextDocument.
# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
Result:
[1] "character(0)" "character(0)" "character(0)" "character(0)" "character(0)"
Now it works, but erase all the metadata (in this case the doc names). There is a way to mantain the metadata, or to save and then restore them?
I found it.
The line:
solves the problem but erase the metadata.
I found this answer that explain a better way to use tm_map(). I just have to substitute:
with:
And all works!