R break corpus into sentences-第2页回答

I have a number of PDF documents, which I have read into a corpus with library tm. How can one break the corpus into sentences?
It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually.
How can I pass function sentSplit {qdap} over a corpus in tm? Or is there a better way?.

Note: there was a function sentDetect in library openNLP, which is now Maxent_Sent_Token_Annotator - the same question applies: how can this be combined with a corpus [tm]?

标签： r split tm sentence qdap

7条回答

你好瞎i

2楼-- · 2019-01-22 16:03

With qdap version 1.1.0 you can accomplish this with the following (I used @Tony Breyal's current.corpus dataset):

library(qdap)
with(sentSplit(tm_corpus2df(current.corpus), "text"), df2tm_corpus(tot, text))

You could also do:

tm_map(current.corpus, sent_detect)


## inspect(tm_map(current.corpus, sent_detect))

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## $doc1
## [1] Doctor Who is a British science fiction television programme produced by the BBC.                                                                     
## [2] The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor.                                            
## [3] He explores the universe in his TARDIS, a sentient time-travelling space ship.                                                                        
## [4] Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired.                                    
## [5] Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.
## 
## $doc2
## [1] The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.
## [2] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor.                                                                                                                                                                                                       
## [3] In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody for evolving with technology and the times like nothing else in the known television universe.                                                                                                                                   
## 
## $doc3
## [1] The programme is listed in Guinness World Records as the longest-running science fiction television show in the world and as the most successful science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.
## [2] During its original run, it was recognised for its imaginative stor

0人赞添加讨论(0) 举报

上一页 1 2

R break corpus into sentences

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间