This is what I've in mind, but it's O(n^2):
For ex: Input is "Thisisawesome", we need to check if adding the current character makes the older found set any longer and meaningful. But in order to see till where we need to back up we'll have to traverse all the way to the beginning. For ex: "awe" and "some" make proper words but "awesome" makes the bigger word. Please suggest how can we improve the complexity. Here is the code:
void update(string in)
{
int len= in.length();
int DS[len];
string word;
for(int i=0; i<len; i++) DS[i]=0;
for(int i=0; i<len; i++)
for(int j=i+1; j<=len; j++)
{
word = in.substr(i,j-i);
if(dict.find(word)!=dict.end())
DS[j-1] = (DS[j-1] > word.length()) ? DS[j-1] : word.length();
}
}
There is a dynamic programming solution which at first looks like it is going to be O(n^2) but which turns out to be only O(n) for sufficiently large n and fixed size dictionary.
Work through the string from left to right. At the ith stage you need to work out whether there is a solution for the first i characters. To solve this, consider every possible way to break those i characters into two chunks. If the second chunk is a word and the first chunk can be broken up into words then there is a solution. The first requirement you can check with your dictionary. The second requirement you can check by looking to see if you found an answer for the first j characters, where j is the length of the first chunk.
This would be O(n^2) because for each of 1,2,3,...n lengths you consider every possible split. However, if you know what the longest word in your dictionary is you know that there is no point considering splits which make the second chunk longer than this. So for each of 1,2,3...n lengths you consider at most w possible splits, where w is the longest word in your dictionary, and the cost is O(n).
I have coded my solution today, and will put it on a web site tomorrow. Anyway, the method is as follows:
Arrange the dictionary in a trie.
The trie can help to do multiple matches quickly, because all dictionary words starting with the same letters can be matched at the same time.
(e.g. "chairman" matches "chair" and "chairman" in a trie.)
Use Dijkstra algorithm to find the best match.
(e.g. for "chairman", if we count "c" as position 0, then we have the relationships 0->5, 0->8, 1->5, 2->5, 5->8. These relationship form a network perfect for Dijkstra algorithm.)
(Note: Where's the weights of the edges? See the next point.)
Assign weighting to dictionary words.
Without weighting bad matches do weight over good matches. (e.g. "iamahero" becomes "i ama hero" instead of "i am a hero".)
The SCOWL dictionary at http://app.aspell.net/create serve the purpose well, because it has dictionaries of different sizes. These sizes (10, 20, etc.) is a good choice for weighing).
After some tries I found a need to reduce the weighing of words ending with "s", so "eyesandme" become "eyes and me" instead of "eye sand me".
I have been able to split a paragraph in milliseconds. The algorithm has linear complexity on the length of the string to be splitted, so the algorithm scales well as long as memory is enough.
Here's the dump (sorry for bragging). (The passage selected is "Novel" in Wikipedia.)
D:\GoogleDrive\programs\WordBreaker>"word breaker"<novelnospace.txt>output.txt
D:\GoogleDrive\programs\WordBreaker>type output.txt
Number of words after reading words-10.txt : 4101
Number of words after reading words-20.txt : 11329
Number of words after reading words-35.txt : 43292
Number of words after reading words-40.txt : 49406
Number of words after reading words-50.txt : 87966
Time elapsed in reading dictionary: 0.956782s
Enter the string to be broken into words:
Result:
a novel is along narrative normally in prose which describes fictional character
s and events usually in the form of a sequential story while i an watt in the ri
se of the novel 1957 suggests that the novel came into being in the early 18 th
century the genre has also been described as possessing a continuous and compreh
ensive history of about two thousand years with historical roots in classical gr
eece and rome medieval early modern romance and in the tradition of the novel la
the latter an italian word used to describe short stories supplied the present g
eneric english term in the 18 th century miguel de cervantes author of don quixo
te is frequently cited as the first significant europe an novelist of the modern
era the first part of don quixote was published in 1605 while a more precise de
finition of the genre is difficult the main elements that critics discuss are ho
w the narrative and especially the plot is constructed the themes settings and c
haracterization how language is used and the way that plot character and setting
relate to reality the romance is a related long prose narrative w alter scott d
efined it as a fictitious narrative in prose or verse the interest of which turn
s upon marvellous and uncommon incidents whereas in the novel the events are acc
ommodated to the ordinary train of human events and the modern state of society
however many romances including the historical romances of scott emily brontes w
u the ring heights and her man melvilles mo by dick are also frequently called n
ovels and scott describes romance as a kind red term romance as defined here sho
uld not be confused with the genre fiction love romance or romance novel other e
urope an languages do not distinguish between romance and novel a novel isle rom
and err o ma nil roman z o
Time elapsed in splitting: 0.00495095s
D:\GoogleDrive\programs\WordBreaker>type novelnospace.txt
Anovelisalongnarrativenormallyinprosewhichdescribesfictionalcharactersandeventsu
suallyintheformofasequentialstoryWhileIanWattinTheRiseoftheNovel1957suggeststhat
thenovelcameintobeingintheearly18thcenturythegenrehasalsobeendescribedaspossessi
ngacontinuousandcomprehensivehistoryofabouttwothousandyearswithhistoricalrootsin
ClassicalGreeceandRomemedievalearlymodernromanceandinthetraditionofthenovellaThe
latteranItalianwordusedtodescribeshortstoriessuppliedthepresentgenericEnglishter
minthe18thcenturyMigueldeCervantesauthorofDonQuixoteisfrequentlycitedasthefirsts
ignificantEuropeannovelistofthemodernerathefirstpartofDonQuixotewaspublishedin16
05Whileamoreprecisedefinitionofthegenreisdifficultthemainelementsthatcriticsdisc
ussarehowthenarrativeandespeciallytheplotisconstructedthethemessettingsandcharac
terizationhowlanguageisusedandthewaythatplotcharacterandsettingrelatetorealityTh
eromanceisarelatedlongprosenarrativeWalterScottdefineditasafictitiousnarrativein
proseorversetheinterestofwhichturnsuponmarvellousanduncommonincidentswhereasinth
enoveltheeventsareaccommodatedtotheordinarytrainofhumaneventsandthemodernstateof
societyHowevermanyromancesincludingthehistoricalromancesofScottEmilyBrontesWuthe
ringHeightsandHermanMelvillesMobyDickarealsofrequentlycallednovelsandScottdescri
besromanceasakindredtermRomanceasdefinedhereshouldnotbeconfusedwiththegenreficti
onloveromanceorromancenovelOtherEuropeanlanguagesdonotdistinguishbetweenromancea
ndnovelanovelisleromanderRomanilromanzo
D:\GoogleDrive\programs\WordBreaker>