Given
a dictionary full of words
{in, july, den, dentist, best, ...}
with some C++ API to access it:boolean findWord(string word)
, orstring getNextWord(void)
to iterate through it,some input string with no space, e.g.:
bestdentistinjuly
...
Output
best dentist in july is...
(basically separate the non-space string by space given a dictionary)
What will be the best algorithm to solve it?
A subtle but important question is, is there any fancy way to solve the unreachable dead-end problem. E.g., den
and dentist
are both valid words to dissect the rest of the string, one of them may just be a dead-end.
To me it seems like a greedy problem or something solvable by dynamic programming..
This problem is basically like matching a regex of the form:
So any regex algorithm may be used. Which you should choose depends on what will be the size of the dictionary and the length of the input. You should probably start here: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
You can create a kind of word tree:
You can go throught the string with no space. Once you find a word in your list, you add a space and you continu... until you cannot go further.
Then you go back to the previous word and try to se if adding new letter you can create a word, and if you can continu from their.
You try this until you tried all the possiblities.
If you go back to the starting word and you don't find any solution => no sol
Here is the algorithm ( my pseudocode syntax is not good, but you can get the general idea. I believe you will have to correct it a little):
Algorithm ends when you find no sol, or when your at a "leaf" (you went thhrough the whole string)
Here is an illustration using your example:
Hope my explaination is clear.
Use a Trie to store the dictionary. You can see a simple implementation (C#) at How to create a trie in c#
You're going to need to do a search because you don't know if you are on the right track until you have considered the whole input string. You'll need to iterate through the input string, at the same time as descending into the trie. When you get to a terminal node of the trie, you have a branch in your search process: you need to both treat that as the end of a word and treat it as the first part of a longer word.
Maybe there are more than one valid solutions to separate the input string. You could use a backtracking algorithm to just find one or all valid solutions. On positions where two or more words e.g. "den", "dentist" are possible, the algorithm should try the longer words first.
The dictionary of course should be stored into a Trie to quickly find the matching words.
In the following ascii image the left branch would be examined first in a Depth-first search which prefers longer words. A solution would be found before the algorithm looks at the right branch with the word "den".
It is unclear from original post on what exactly to solve. I am guessing that @Figo is looking for something similar to a string matching algorithm.
A very good resource for this: http://igm.univ-mlv.fr/~lecroq/string/
I think you could make it faster if you could have the
findWord
method return different values for 'not-a-word' and 'no-words-starting-with-this-prefix'. This would be easy if the dictionary was stored as a trie.The reason is that if you are checking words as in @Ricky Bobby's answer, after you find 'best', you still need to check 'bestd' and 'bestde' and so on all the way to the end of the string. However if the check for 'bestd' returns 'no-longer-words', then you have trimmed out a whole bunch of searching.