Say I had this string:
hellohowareyou
Is there a fast way to separate this into individual words, so the end result is hello how are you
? I can think of several ways, but they would be EXTREMELY slow (first I need to identify each letter against a dictionary, see which letters compose a words, and there would be probably multiple combinations, then I need to decide the most likely combination etc.)
Here's some code that does a recursive brute-force search. It puts the word list into a set, so the lookups are quite fast: the examples below run in less than 1 second on my old 2 GHz machine with 2GB of RAM. However, splitting longer sequences than the examples I've used will certainly take longer, mostly because there are so many possible combinations. To weed out the meaningless results you will either need to do that manually, or use software that can do natural language processing.
output
This code was written for Python 3, but you can make it run on Python 2 by changing
to
BTW, we can sort the output lists by length, i.e., the number of words in each list. This tends to put the more sensible results near the top.
It's a "hard" problem, in that you need to employ some heuristics with more than just a dictionary behind it. You can turn a dictionary into a tree, to allow it to be efficiently searched letter by letter from the presumed start of a word, but the harder bit is where do you go when you run into a string of letters that's not in the dictionary. Things like "ABS" (a plastic) or "invac" (a bank employee's possible shorthand for "investment account") or "ncie" (a typo for "nice").
Oh, and there are also intrinsic ambiguities where a missing space makes a big difference to what follows. Consider "therapists" ... you need to be a human (or almost) to analyze the following context to work out if a space is needed after "the", or not.
Thoughts:
Taking the sentence:
Isthereafastwaytoseparatethisintoindividualwordssotheendresultishellohow areyouIcanthinkofseveralwaysbuttheywouldbeEXTREMELYslowfirstIneedtoidentify eachletteragainstadictionaryseewhichletterscomposeawords...
A person would be able to split this up quite well into a sentence that makes sense. Therefore, a machine should be able to do the same.
Taking:
"Is the reafastwaytoseparate..." should be "Is there a fast way to separate" Note that no matter how many letters are taken after reafast... they will never make a word.
Therefore, a possible, correct approach would be to run through the sentence finding the shortest possible words until the following word is not a word. This can be approximated with taking 15 letters before lengthening the original word.
In rare cases, you may need to go back to the word two previous or, in extremely rare cases, 2 or three words back. Additionally, the 15 letters could be too few for longer words.
Finally, if there are proper nouns or words from other languages, they will not be in the dictionary. Therefore, after a word cannot be found, the next word can be the new starting point and that word can be flagged or ignored. In a learning model, and this case, it should be added to the corpus or words.
Splitting the words into parts of speech (verbs, nouns, etc) could speed the process up, since an adjective is usually followed by a noun for example. But this may not be worth the effort as another adjective could follow. In any case, all words in the corpus must be tested against, as this isn't built to check grammar.