How to do a efficiently check if a string partiall

2019-07-29 09:39发布

Say I have a Set of Strings:

Set<String> things = new HashSet<String>();
things.add("coffee cup");
things.add("smartphone");
things.add("inkjet printer");
//   :
// list could be quite large (100K or so, perhaps loaded from a database)
//   :

Now I want to check if another string completely contains any of the Strings in the above set. So:

"a coffee cup" - matches
"android smartphone" - matches
"inkjet printer for sale" - matches
"laser printer" - does not match
"printer" - does not match

The only way I can think of is iterating through the set (and break-ing if found). Is there a more efficient and elegant way to do this?

3条回答
Summer. ? 凉城
2楼-- · 2019-07-29 10:08

I built off of @meriton's suggestion. Instead of every possible substring combination, I'm going to do every possible word combination.

Set<String> permutations = new HashSet<String>();

String [] arr = token.split(" ");  
int size = arr.length;

for (int i = size ; i > 0; i--) {
    for (int j = 0 ; j < i; j++) {

        StringBuilder permutation = new StringBuilder();
        permutation.append(arr[j]);
        for (int k = j+1  ; k < i; k++) {
            permutation.append(" ");
            permutation.append(arr[k]);
        }
        permutations.add(permutation.toString());

    }
}

If I run "inkjet printer for sale" through the code snippet above, I get:

  • inkjet printer for sale
  • printer for sale
  • for sale
  • sale
  • inkjet printer for
  • printer for
  • for
  • inkjet printer
  • printer
  • inkjet

Then I can do a simple contains() on the original set of words.

查看更多
Bombasti
3楼-- · 2019-07-29 10:23

Iterate over all substrings of the candidate, and check whether the set contains them?

boolean containsSubstring(Set<String> set, String str) {
    for (int i = 0; i < str.length; i++) {
        for (int j = i + 1; j < str.length; j++) {
            if (set.contains(str.substring(i,j))) {
                return true;
            }
        }
    }
    return false;
}

Yes, a string of length k has k^2 substrings, but that may still be far less than the number of strings in the set ...

查看更多
该账号已被封号
4楼-- · 2019-07-29 10:25

You need Aho-Corasick algorithm. http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

https://github.com/raymanrt/aho-corasick

Time complexity is O(m) for preprocessing (where m is total length of strings in the set) and O(n) for matching (where n is length of matched string). So it's asymptotically optimal.

查看更多
登录 后发表回答