Split a sentence into separate words

I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼怎么走).

At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:

try to find the first two characters of the sentence in the database (主楼),
if 主楼 is actually a word and it's in the database the script will try to find first three characters (主楼怎). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.
try do it with the rest of characters.

I don't really like this approach, because to analyze even a small text it would query the database too many times.

Is there any other solutions to this?

标签： php cjk multibyte text-segmentation

11条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-02-01 12:24

Another one that works well is http://www.itgrass.com/phpanalysis/index.html

Its the only one that I found that works properly with utf-8. The rest only worked for me in gb18030, which caused tons of issues later on down the line. I thought I was going to have to start over, but this one saved me a lot of time.

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-02-01 12:25

To improve the performance of this, can't you do all these checks before you insert the sentence into the database, and add spaces yourself?

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

4楼-- · 2019-02-01 12:30

You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.

0人赞添加讨论(0) 举报

在下西门庆

5楼-- · 2019-02-01 12:31

Well, if you have a database with all words and there is no other way to get those word involved I think you are forced to re-query the database.

0人赞添加讨论(0) 举报

Anthone

6楼-- · 2019-02-01 12:34

You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.

With decent indexing on the word column though, you shouldn't have too many problems.

Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.

At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.

Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.

0人赞添加讨论(0) 举报

女痞

7楼-- · 2019-02-01 12:34

You can build very very long Regular Expression.

Edit: I meant to build it automatically with script from the DB. Not to write it by hand.

0人赞添加讨论(0) 举报

1 2 下一页

Split a sentence into separate words

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间