I have a number of PDF documents, which I have read into a corpus with library
tm
. How can one break the corpus into sentences?It can be done by reading the file with
readLines
followed bysentSplit
from packageqdap
[*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually.How can I pass function
sentSplit
{qdap
} over a corpus intm
? Or is there a better way?.
Note: there was a function sentDetect
in library openNLP
, which is now Maxent_Sent_Token_Annotator
- the same question applies: how can this be combined with a corpus [tm]?
With qdap version 1.1.0 you can accomplish this with the following (I used @Tony Breyal's
current.corpus
dataset):You could also do: