-->

Searching strings with . wildcard

2020-08-01 06:11发布

问题:

I have an array with so much strings and want to search for a pattern on it. This pattern can have some "." wildcard who matches (each) 1 character (any).

For example:

myset = {"bar", "foo", "cya", "test"}

find(myset, "f.o") -> returns true (matches with "foo") 
find(myset, "foo.") -> returns false 
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")

I tried to find a way to implement this algorithm fast because myset actually is a very big array, but none of my ideas has satisfactory complexity (for example O(size_of(myset) * lenght(pattern)))

Edit:

myset is an huge array, the words in it aren't big. I can do a slow preprocessing. But I'll have so much find() queries, so find() I want find() to be as fast as possible.

回答1:

You could build a suffix tree of the corpus of all possible words in your set (see this link) Using this data structure your complexity would include a one time cost of O(n) to build the tree, where n is the sum of the lengths of all your words.

Once the tree is built finding if a string matches should take just O(n) where n is length of the string.



回答2:

If the set is fixed, you could pre-calculate frequencies of a character c being at position p (for as many p values as you consider worth-while), then search through the array once, for each element testing characters at specific positions in an order such that you are most likely to exit early.



回答3:

First, divide the corpus into sets per word length. Then your find algorithm can search over the appropriate set, since the input to find() always requires the match to have a specific length, and the algorithm can be designed to work well with all words of the same length.

Next (for each set), create a hash map from a hash of character x position to a list of matching words. It is quite ok to have a large amount of hash collision. You can use delta and run-length encoding to reduce the size of the list of matching words.

To search, pick the appropriate hash map for the find input length, and for each non . character, calculate the hash for that character x position, and AND together the lists of words, to get a much reduced list.

Brute force search through that much smaller list.



回答4:

If you are sure that the length of the words in your set are not large. You could probably create a table which holds the following:

List of Words which have first Character 'a' , List of Words which have first Character 'b', ..

List of Words which have second Character 'a', List of words which have second Character 'b', ..

and so on.

When you are searching for the word. You can look for the list of words which have the first character same as the search strings' first character. With this refined list, look for the words which have the second character same as the search strings' second character and so on. You can ignore '.' whenever you encounter them.

I understand that building the table may take a large amount of space but the time taken will come down significantly.

For example, if you have myset = {"bar", "foo", "cya", "test"} and you are searching for 'f.o'

The moment you check for list of words starting with f, you eliminate the rest of the set. Just an idea.. Hope it helps.



回答5:

I had this same question, and I wasn't completely happy with most of the ideas/solutions I found on the internet. I think the "right" way to do this is to use a Directed Acyclic Word Graph. I didn't quite do that, but I added some additional logic to a Trie to get a similar effect.

See my isWord() implementation, analogous to your desired find() interface. It works by recursing down the Trie, branching on wildcard, and then collecting results back into a common set. (See findNodes().)

getMatchingWords() is similar in spirit, except that it returns the set of matching words, instead of just a boolean as to whether or not the query matches anything.