I have a set of documents:
documents = c("She had toast for breakfast",
"The coffee this morning was excellent",
"For lunch let's all have pancakes",
"Later in the day, there will be more talks",
"The talks on the first day were great",
"The second day should have good presentations too")
In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using:
documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation
First I convert to a Corpus object:
documents <- Corpus(VectorSource(documents))
Then I try to remove the stopwords:
documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords
But this last line results in the following error:
THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() to debug.
This has already been asked here but an answer was not given. What does this error mean?
EDIT
Yes, I am using the tm package.
Here is the output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
When I run into tm
problems I often end up just editing the original text.
For removing words it's a little awkward, but you can paste together a regex from tm
's stopword list.
stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')
> documents
[1] " toast breakfast" " coffee morning excellent"
[3] " lunch lets pancakes" "later day will talks"
[5] " talks first day great" " second day good presentations "
Maybe try to use the tm_map
function to transform the document. It seems to work in my case.
> documents = c("She had toast for breakfast",
+ "The coffee this morning was excellent",
+ "For lunch let's all have pancakes",
+ "Later in the day, there will be more talks",
+ "The talks on the first day were great",
+ "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 6
This yields
> documents[[1]]$content
[1] " toast breakfast"
> documents[[2]]$content
[1] " coffee morning excellent"
> documents[[3]]$content
[1] " lunch lets pancakes"
> documents[[4]]$content
[1] "later day will talks"
> documents[[5]]$content
[1] " talks first day great"
> documents[[6]]$content
[1] " second day good presentations "
you can use quanteda package to remove stop words, but first make sure your words are tokens and then use the following:
library(quanteda)
x<- tokens_select(x,stopwords(), selection=)