Splitting chinese document into sentences [closed]

I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese.

Please can you let me know any good sentence splitters for Chinese preferably in Java or Python.

标签： nlp tokenize stanford-nlp sentence

3条回答

淡お忘

2楼-- · 2019-04-09 02:27

Using some regex tricks in Python (c.f. a modified regex of Section 2.3 of http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf):

import re

paragraph = u'\u70ed\u5e26\u98ce\u66b4\u5c1a\u5854\u5c14\u662f2001\u5e74\u5927\u897f\u6d0b\u98d3\u98ce\u5b63\u7684\u4e00\u573a\u57288\u6708\u7a7f\u8d8a\u4e86\u52a0\u52d2\u6bd4\u6d77\u7684\u5317\u5927\u897f\u6d0b\u70ed\u5e26\u6c14\u65cb\u3002\u5c1a\u5854\u5c14\u4e8e8\u670814\u65e5\u7531\u70ed\u5e26\u5927\u897f\u6d0b\u7684\u4e00\u80a1\u4e1c\u98ce\u6ce2\u53d1\u5c55\u800c\u6210\uff0c\u5176\u5b58\u5728\u7684\u5927\u90e8\u5206\u65f6\u95f4\u91cc\u90fd\u5728\u5feb\u901f\u5411\u897f\u79fb\u52a8\uff0c\u9000\u5316\u6210\u4e1c\u98ce\u6ce2\u540e\u7a7f\u8d8a\u4e86\u5411\u98ce\u7fa4\u5c9b\u3002'

def zng(paragraph):
    for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
        yield sent

list(zng(paragraph))

Regex explanation: https://regex101.com/r/eNFdqM/2

0人赞添加讨论(0) 举报

Anthone

3楼-- · 2019-04-09 02:44

For unsegmented text, using the Stanford libraries, you probably want to use their Chinese CoreNLP. This isn't as well documented as the base corenlp, but it will work for your task.

http://nlp.stanford.edu/software/corenlp-faq.shtml#languages http://nlp.stanford.edu/software/corenlp.shtml

You will want the segmenter and the sentence splitter. "segment, ssplit" The others are not relevant.

Alternatively, you can use the WordToSentenceSplitter class in edu.stanford.nlp.process.WordToSentenceSplitter directly. If you do that, you can look at how it is used in WordsToSentencesAnnotator.

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-04-09 02:50

Either of these open sources projects should be useful afaik:

HanLP https://github.com/hankcs/HanLP
FudanNLP https://github.com/FudanNLP/fnlp

0人赞添加讨论(0) 举报

Splitting chinese document into sentences [closed]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间