I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I would use the following algorithm.
Start with the sorted list of words to split, and a sorted list of declined words (dictionary).
Create a result list of objects which should store: remaining word and list of matched words.
Fill the result list with the words to split as remaining words.
Walk through the result array and the dictionary concurrently -- always increasing the least of the two, in a manner similar to the merge algorithm. In this way you can compare all the possible matching pairs in one pass.
Any time you find a match, i.e. a split words word that starts with a dictionary word, replace the matching dictionary word and the remaining part in the result list. You have to take into account possible multiples.
Any time the remaining part is empty, you found a final result.
Any time you don't find a match on the "left side", in other words, every time you increment the result pointer because of no match, delete the corresponding result item. This word has no matches and can't be split.
Once you get to the bottom of the lists, you will have a list of partial results. Repeat the loop until this is empty -- go to point 4.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ...
(len(yourstring) - 1)
and then simply check each of those in a set or hash table.