可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to solve word splitting problem (parse words from long string with no spaces). For examle we want extract words from somelongword to [some, long, word].

We can achieve this by some dynamic approach with dictionary, but another issue we encounter is parsing ambiguity. I.e. orcore => or core or orc ore (We don't take into account phrase meaning or part of speech). So i think about usage of some statistical or ML approach.

I found that Naive Bayes and Viterbi algorithm with train set can be used for solving this. Can you point me some information about application of these algorithms to word splitting problem?

UPD: I've implemented this method on Clojure, using some advices from Peter Norvig's code

回答1:

I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.

回答2:

This problem is entirely analagous to word segmentation in many Asian languages that don't explicitly encode word boundaries (e.g. Chinese, Thai). If you want background on approaches to the problem, I'd recommend you look at Google Scholar for current Chinese Word Segmentation approaches.

You might start by looking at some older approaches: Sproat, Richard and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff (http://www.sighan.org/bakeoff2003/paper.pdf)

If you want a ready-made solution, I'd recommend LingPipe's tutorial (http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html). I've used it on unsegmented English text with good results. I trained the underlying character language model on a couple million words of newswire text, but I suspect that for this task you'll get reasonable performance using any corpus of relatively normal English text.

They used a spelling-correction system to recommend candidate 'corrections' (where the candidate corrections are identical to the input but with spaces inserted). Their spelling corrector is based on Levenshtein edit distance; they just disallow substitution and transposition, and restrict allowable insertions to only a single space.