algorithm to parse string with dictionary

2019-03-11 06:30发布

Given

  • a dictionary full of words {in, july, den, dentist, best, ...} with some C++ API to access it: boolean findWord(string word), or string getNextWord(void) to iterate through it,

  • some input string with no space, e.g.: bestdentistinjuly...

Output

  • best dentist in july is... (basically separate the non-space string by space given a dictionary)

What will be the best algorithm to solve it?

A subtle but important question is, is there any fancy way to solve the unreachable dead-end problem. E.g., den and dentist are both valid words to dissect the rest of the string, one of them may just be a dead-end.

To me it seems like a greedy problem or something solvable by dynamic programming..

6条回答
Animai°情兽
2楼-- · 2019-03-11 06:53

This problem is basically like matching a regex of the form:

(in|july|den|dentis|best|...)*

So any regex algorithm may be used. Which you should choose depends on what will be the size of the dictionary and the length of the input. You should probably start here: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times

查看更多
够拽才男人
3楼-- · 2019-03-11 06:55

You can create a kind of word tree:

You can go throught the string with no space. Once you find a word in your list, you add a space and you continu... until you cannot go further.

Then you go back to the previous word and try to se if adding new letter you can create a word, and if you can continu from their.

You try this until you tried all the possiblities.

If you go back to the starting word and you don't find any solution => no sol

Here is the algorithm ( my pseudocode syntax is not good, but you can get the general idea. I believe you will have to correct it a little):

TreeWordResult //Tree that keeps the results in memory

Go through the InputString:

      If you find a word in the InputDictionnary 
          Then add this word to the last node of the treeWordResult
      Otherwise:
          while (No word found):
                go back in the treeWordResult
                try to find word in InputDictionnary different from the one before (on the other node)
          endwhile
          if no word found:
                     return NO SOLUTION
          otherwise:
                     continue going through word
          endif
       endif
 return Leaf   

Algorithm ends when you find no sol, or when your at a "leaf" (you went thhrough the whole string)

Here is an illustration using your example:

enter image description here

Hope my explaination is clear.

查看更多
Explosion°爆炸
4楼-- · 2019-03-11 07:04

Use a Trie to store the dictionary. You can see a simple implementation (C#) at How to create a trie in c#

You're going to need to do a search because you don't know if you are on the right track until you have considered the whole input string. You'll need to iterate through the input string, at the same time as descending into the trie. When you get to a terminal node of the trie, you have a branch in your search process: you need to both treat that as the end of a word and treat it as the first part of a longer word.

查看更多
forever°为你锁心
5楼-- · 2019-03-11 07:06

Maybe there are more than one valid solutions to separate the input string. You could use a backtracking algorithm to just find one or all valid solutions. On positions where two or more words e.g. "den", "dentist" are possible, the algorithm should try the longer words first.

The dictionary of course should be stored into a Trie to quickly find the matching words.

In the following ascii image the left branch would be examined first in a Depth-first search which prefers longer words. A solution would be found before the algorithm looks at the right branch with the word "den".


        Best
        /   \
   dentist  den
      /
     in
    /
  july

查看更多
smile是对你的礼貌
6楼-- · 2019-03-11 07:08

It is unclear from original post on what exactly to solve. I am guessing that @Figo is looking for something similar to a string matching algorithm.

A very good resource for this: http://igm.univ-mlv.fr/~lecroq/string/

查看更多
The star\"
7楼-- · 2019-03-11 07:12

I think you could make it faster if you could have the findWord method return different values for 'not-a-word' and 'no-words-starting-with-this-prefix'. This would be easy if the dictionary was stored as a trie.

The reason is that if you are checking words as in @Ricky Bobby's answer, after you find 'best', you still need to check 'bestd' and 'bestde' and so on all the way to the end of the string. However if the check for 'bestd' returns 'no-longer-words', then you have trimmed out a whole bunch of searching.

查看更多
登录 后发表回答