what is the fastest substring search method in Jav

I need to implement a way to search substring (needles) in a list of string (haystack) using Java.

More specifically, my app has a list of user profiles. If I type some letters, for example, "Ja", and then search, then all the users whose name contains "ja" should show up. For instance, the result could be "Jack", "Jackson", "Jason", "Dijafu".

In Java, as I know, there are 3 build-in method to see search substring in a string.

string.contains()
string.indexOf()
regular expression. it is something like string.matches("ja"))

My question is: What are the runtimes of each method above? which one is the fastest or most efficient or most popular way to check if the list of string contains a given substring.

I know there exists some algorithms that do the same thing, such as Boyer–Moore string search algorithm, Knuth–Morris–Pratt algorithm and so on. I do not want to use them because I just have a small list of strings, and I think using them is kind of overkill for me right now. Also I have to type a lot of extra coding for such a non-build-in algorithm. If you think my thoughts is not correct, please feel free to correct me.

标签： java regex search substring search-engine

6条回答

欢心

2楼-- · 2019-01-14 23:29

If you are searching a large amount of Strings I've read the Aho-Corasick algorithm is pretty fast, but it's a natively implemented in Java. It's the same algorithm used by GREP in Unix-based systems if that helps and it's pretty efficient. Here is a Java implementation courtesy of Berkley.

0人赞添加讨论(0) 举报

欢心

3楼-- · 2019-01-14 23:30

This is dependent on specific JRE (and even JDK) make/version. It also depends / could depend on factors as string length, probability of being contained, in what position, etc. The only way of obtaining precise performance data requires setting up your exact context.

However, in general aString.contains() and aString.indexOf() should be exactly the same. And even if a regular expression was superbly optimized, it wouldn't exceed the performance of the first two.

No, Java doesn't use extremely specialized algorithms either.

0人赞添加讨论(0) 举报

太酷不给撩

4楼-- · 2019-01-14 23:33

As far as the three you asked about, a regular expression is going to be much slower because it requires putting together a full state machine when you have a much simpler target. For contains vs indexOf...

2114 public boolean contains(CharSequence s) {
2115     return indexOf(s.toString()) > -1;
2116 }

(i.e., contains just calls indexOf, but you might incur an extra String creation on each invocation. This is just one implementation of contains, but since the contract of contains is a simplification of indexOf, this is probably how every implementation will work.)

0人赞添加讨论(0) 举报

smile是对你的礼貌

5楼-- · 2019-01-14 23:33

From the example in your question, I assume you want to do case insensitive comparisons. Those slow down the process considerably. Hence, if you can live with some inaccuracies - which might depend on the locale in which you need to do the comparison, and your long text is searched again and again, it might make sense to convert the long text one time to uppercase, and the search string as well, and then search case-insensitive.

0人赞添加讨论(0) 举报

手持菜刀，她持情操

6楼-- · 2019-01-14 23:34

The accepted answer is not correct and not complete.

indexOf() does a naive string search using backtracking on mismatches. This is quite fast on small patterns/texts but shows very poor performance on large texts
contains("ja") should be comparable to indexOf (because it delegates to it)
matches("ja") will not deliver the correct result, because it searches for an exact match (only the string "ja" will match exactly)
Pattern p = Pattern.compile("ja"); Matcher m = p.matcher("jack"); m.find(); would be the correct way to find texts with regular expressions. In practice (using large texts) it will be the most efficient way using only the java api. This is because a constant pattern (like "ja") will not be processed by the regex engine (which is slow) but by an Boyer-Moore-Algorithm (which is fast)

0人赞添加讨论(0) 举报

看我几分像从前

7楼-- · 2019-01-14 23:38

String[] names = new String[]{"jack", "jackson", "jason", "dijafu"};
long start = 0;
long stop = 0;

//Contains
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].contains("ja");
}
stop = System.nanoTime();
System.out.println("Contains: " + (stop-start));

//IndexOf
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].indexOf("ja");
}
stop = System.nanoTime();
System.out.println("IndexOf: " + (stop-start));

//Matches
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].matches("ja");
}
stop = System.nanoTime();
System.out.println("Matches: " + (stop-start));

Output:

Contains: 16677
IndexOf: 4491
Matches: 864018

0人赞添加讨论(0) 举报

what is the fastest substring search method in Jav

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间