I have a number of PDF documents, which I have read into a corpus with library
tm
. How can one break the corpus into sentences?It can be done by reading the file with
readLines
followed bysentSplit
from packageqdap
[*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually.How can I pass function
sentSplit
{qdap
} over a corpus intm
? Or is there a better way?.
Note: there was a function sentDetect
in library openNLP
, which is now Maxent_Sent_Token_Annotator
- the same question applies: how can this be combined with a corpus [tm]?
This is a function built off this Python solution that allows some flexibility in that the lists of prefixes, suffixes, etc. can be modified to your specific text. It's definitely not perfect, but could be useful with the right text.
openNLP
had some major changes. The bad news is it looks very different than it used to. The good news is that it's more flexible and the functionality you enjoyed before is still there, you just have to find it.This will give you what you're after:
?Maxent_Sent_Token_Annotator
Just work through the example and you'll see the functionality you're looking for.
I don't know how to reshape a corpus but that would be a fantastic functionality to have.
I guess my approach would be something like this:
Using these packages
I would set up my text to sentences function as follows:
And my hack of a reshape corpus function (NB: you will lose the meta attributes here unless you modify this function somehow and copy them over appropriately)
Which works as follows:
My sessionInfo output
I implemented the following code to solve the same problem using the
tokenizers
package.The error is meant to be connected with ggplot2 package and the annotate function gives this error, detach the ggplot2 package and then try again. Hopefully it should work.
Just convert your corpus into a dataframe and use regular expressions to detect the sentences.
Here is a function that uses regular expressions to detect sentences in a paragraph and returns each individual sentence.
...Using one paragraph inside a corpus from the tm package.
Use as follows:
Which gives us:
Now with a larger corpus
Use as follows:
Which gives us: