Input: "tableapplechairtablecupboard..."
many words
What would be an efficient algorithm to split such text to the list of words and get:
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)
P.S.
We have a list of all possible words.
Word "cupboard" can be "cup" and "board", select longest.
Language: python, but main thing is the algorithm itself.
Here is solution using recursive search:
yields
Based on the excellent work in the top answer, I've created a
pip
package for easy use.To install, run
pip install wordninja
.The only differences are minor. This returns a
list
rather than astr
, it works inpython3
, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).Thanks again to Generic Human!
https://github.com/keredson/wordninja
Unutbu's solution was quite close but I find the code difficult to read, and it didn't yield the expected result. Generic Human's solution has the drawback that it needs word frequencies. Not appropriate for all use case.
Here's a simple solution using a Divide and Conquer algorithm.
find_words('cupboard')
will return['cupboard']
rather than['cup', 'board']
(assuming thatcupboard
,cup
andboard
are in the dictionnary)find_words('charactersin')
could return['characters', 'in']
or maybe it will return['character', 'sin']
(as seen below). You could quite easily modify the algorithm to return all optimal solutions.The code:
This will take about about 5sec on my 3GHz machine:
Expanding on @miku's suggestion to use a
Trie
, an append-onlyTrie
is relatively straight-forward to implement inpython
:We can then build a
Trie
-based dictionary from a set of words:Which will produce a tree that looks like this (
*
indicates beginning or end of a word):We can incorporate this into a solution by combining it with a heuristic about how to choose words. For example we can prefer longer words over shorter words:
We can use this function like this:
Because we maintain our position in the
Trie
as we search for longer and longer words, we traverse thetrie
at most once per possible solution (rather than2
times forpeanut
:pea
,peanut
). The final short-circuit saves us from walking char-wise through the string in the worst-case.The final result is only a handful of inspections:
A benefit of this solution is in the fact that you know very quickly if longer words exist with a given prefix, which spares the need to exhaustively test sequence combinations against a dictionary. It also makes getting to an
unsolvable
answer comparatively cheap to other implementations.The downsides of this solution are a large memory footprint for the
trie
and the cost of building thetrie
up-front.For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.
https://github.com/dtuggener/CharSplit
The answer by https://stackoverflow.com/users/1515832/generic-human is great. But the best implementation of this I've ever seen was written Peter Norvig himself in his book 'Beautiful Data'.
Before I paste his code, let me expand on why Norvig's method is more accurate (although a little slower and longer in terms of code).
1) The data is a bit better - both in terms of size and in terms of precision (he uses a word count rather than a simple ranking) 2) More importantly, it's the logic behind n-grams that really makes the approach so accurate.
The example he provides in his book is the problem of splitting a string 'sitdown'. Now a non-bigram method of string split would consider p('sit') * p ('down'), and if this less than the p('sitdown') - which will be the case quite often - it will NOT split it, but we'd want it to (most of the time).
However when you have the bigram model you could value p('sit down') as a bigram vs p('sitdown') and the former wins. Basically, if you don't use bigrams, it treats the probability of the words you're splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.
Here's the link to the data (it's data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/
and here's the link to the code: http://norvig.com/ngrams/ngrams.py
These links have been up a while, but I'll copy paste the segmentation part of the code here anyway