How to split a string into words. Ex: “stringintow

2019-01-07 05:37发布

What is the right way to split a string into words ? (string doesn't contain any spaces or punctuation marks)

For example: "stringintowords" -> "String Into Words"

Could you please advise what algorithm should be used here ?

! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.

13条回答
够拽才男人
2楼-- · 2019-01-07 06:07

Actually, with the dictionary this problem can be solved in O(n) time. More precisely in (k + 1) * n at worst, where n is the number of characters in the string and k is the length of the longest word in the dictionary.

Besides, the algorithm allows you to skip junk.

Here's the working implementation in Common Lisp I've created some time ago: https://gist.github.com/3381522

查看更多
▲ chillily
3楼-- · 2019-01-07 06:10

A simple Java solution which has O(n^2) running time.

public class Solution {
    // should contain the list of all words, or you can use any other data structure (e.g. a Trie)
    private HashSet<String> dictionary;

    public String parse(String s) {
        return parse(s, new HashMap<String, String>());
    }

    public String parse(String s, HashMap<String, String> map) {
        if (map.containsKey(s)) {
            return map.get(s);
        }
        if (dictionary.contains(s)) {
            return s;
        }
        for (int left = 1; left < s.length(); left++) {
            String leftSub = s.substring(0, left);
            if (!dictionary.contains(leftSub)) {
                continue;
            }
            String rightSub = s.substring(left);
            String rightParsed = parse(rightSub, map);
            if (rightParsed != null) {
                String parsed = leftSub + " " + rightParsed;
                map.put(s, parsed);
                return parsed;
            }
        }
        map.put(s, null);
        return null;
    }
}
查看更多
一夜七次
4楼-- · 2019-01-07 06:11

If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.

For example: windowsteamblog (of http://windowsteamblog.com/ fame)

  • windows team blog
  • window steam blog
查看更多
你好瞎i
5楼-- · 2019-01-07 06:13

As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:

(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).

(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).

查看更多
倾城 Initia
6楼-- · 2019-01-07 06:14

Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.

Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate

S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).

This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.

查看更多
一夜七次
7楼-- · 2019-01-07 06:19

Create a list of possible words, sort it from long words to short words.

Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.

查看更多
登录 后发表回答