I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走
(with spaces it would be: 主楼 怎么 走
).
At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:
try to find the first two characters of the sentence in the database (
主楼
),if
主楼
is actually a word and it's in the database the script will try to find first three characters (主楼怎
).主楼怎
is not a word, so it's not in the database => my application now knows that主楼
is a separate word.try do it with the rest of characters.
I don't really like this approach, because to analyze even a small text it would query the database too many times.
Is there any other solutions to this?
Another one that works well is http://www.itgrass.com/phpanalysis/index.html
Its the only one that I found that works properly with utf-8. The rest only worked for me in gb18030, which caused tons of issues later on down the line. I thought I was going to have to start over, but this one saved me a lot of time.
To improve the performance of this, can't you do all these checks before you insert the sentence into the database, and add spaces yourself?
You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.
Well, if you have a database with all words and there is no other way to get those word involved I think you are forced to re-query the database.
You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.
With decent indexing on the word column though, you shouldn't have too many problems.
Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.
At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.
Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.
You can build very very long Regular Expression.
Edit: I meant to build it automatically with script from the DB. Not to write it by hand.