what is the fastest substring search method in Jav

2019-01-14 23:09发布

I need to implement a way to search substring (needles) in a list of string (haystack) using Java.

More specifically, my app has a list of user profiles. If I type some letters, for example, "Ja", and then search, then all the users whose name contains "ja" should show up. For instance, the result could be "Jack", "Jackson", "Jason", "Dijafu".

In Java, as I know, there are 3 build-in method to see search substring in a string.

  1. string.contains()

  2. string.indexOf()

  3. regular expression. it is something like string.matches("ja"))

My question is: What are the runtimes of each method above? which one is the fastest or most efficient or most popular way to check if the list of string contains a given substring.

I know there exists some algorithms that do the same thing, such as Boyer–Moore string search algorithm, Knuth–Morris–Pratt algorithm and so on. I do not want to use them because I just have a small list of strings, and I think using them is kind of overkill for me right now. Also I have to type a lot of extra coding for such a non-build-in algorithm. If you think my thoughts is not correct, please feel free to correct me.

6条回答
欢心
2楼-- · 2019-01-14 23:29

If you are searching a large amount of Strings I've read the Aho-Corasick algorithm is pretty fast, but it's a natively implemented in Java. It's the same algorithm used by GREP in Unix-based systems if that helps and it's pretty efficient. Here is a Java implementation courtesy of Berkley.

See also: https://stackoverflow.com/a/1765616/59087

查看更多
欢心
3楼-- · 2019-01-14 23:30

This is dependent on specific JRE (and even JDK) make/version. It also depends / could depend on factors as string length, probability of being contained, in what position, etc. The only way of obtaining precise performance data requires setting up your exact context.

However, in general aString.contains() and aString.indexOf() should be exactly the same. And even if a regular expression was superbly optimized, it wouldn't exceed the performance of the first two.

No, Java doesn't use extremely specialized algorithms either.

查看更多
太酷不给撩
4楼-- · 2019-01-14 23:33

As far as the three you asked about, a regular expression is going to be much slower because it requires putting together a full state machine when you have a much simpler target. For contains vs indexOf...

2114 public boolean contains(CharSequence s) {
2115     return indexOf(s.toString()) > -1;
2116 }

(i.e., contains just calls indexOf, but you might incur an extra String creation on each invocation. This is just one implementation of contains, but since the contract of contains is a simplification of indexOf, this is probably how every implementation will work.)

查看更多
smile是对你的礼貌
5楼-- · 2019-01-14 23:33

From the example in your question, I assume you want to do case insensitive comparisons. Those slow down the process considerably. Hence, if you can live with some inaccuracies - which might depend on the locale in which you need to do the comparison, and your long text is searched again and again, it might make sense to convert the long text one time to uppercase, and the search string as well, and then search case-insensitive.

查看更多
手持菜刀,她持情操
6楼-- · 2019-01-14 23:34

The accepted answer is not correct and not complete.

  • indexOf() does a naive string search using backtracking on mismatches. This is quite fast on small patterns/texts but shows very poor performance on large texts
  • contains("ja") should be comparable to indexOf (because it delegates to it)
  • matches("ja") will not deliver the correct result, because it searches for an exact match (only the string "ja" will match exactly)
  • Pattern p = Pattern.compile("ja"); Matcher m = p.matcher("jack"); m.find(); would be the correct way to find texts with regular expressions. In practice (using large texts) it will be the most efficient way using only the java api. This is because a constant pattern (like "ja") will not be processed by the regex engine (which is slow) but by an Boyer-Moore-Algorithm (which is fast)
查看更多
看我几分像从前
7楼-- · 2019-01-14 23:38
String[] names = new String[]{"jack", "jackson", "jason", "dijafu"};
long start = 0;
long stop = 0;

//Contains
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].contains("ja");
}
stop = System.nanoTime();
System.out.println("Contains: " + (stop-start));

//IndexOf
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].indexOf("ja");
}
stop = System.nanoTime();
System.out.println("IndexOf: " + (stop-start));

//Matches
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].matches("ja");
}
stop = System.nanoTime();
System.out.println("Matches: " + (stop-start));

Output:

Contains: 16677
IndexOf: 4491
Matches: 864018
查看更多
登录 后发表回答