What is the right way to split a string into words ? (string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Actually, with the dictionary this problem can be solved in
O(n)
time. More precisely in(k + 1) * n
at worst, wheren
is the number of characters in the string andk
is the length of the longest word in the dictionary.Besides, the algorithm allows you to skip junk.
Here's the working implementation in Common Lisp I've created some time ago: https://gist.github.com/3381522
A simple Java solution which has O(n^2) running time.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example:
windowsteamblog
(of http://windowsteamblog.com/ fame)windows
team
blog
window
steam
blog
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
Let's assume that you have a function
isWord(w)
, which checks ifw
is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some wordw
such a splitting is possible. This can be easily done with dynamic programming.Let
S[1..length(w)]
be a table with Boolean entries.S[i]
is true if the wordw[1..i]
can be split. Then setS[1] = isWord(w[1])
andfor i=2
tolength(w)
calculateS[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.