What is the fastest way to search for substring in

2019-06-02 21:40发布

问题:

I want to understand the performance issues that can emerge while making substring search in Java. I know the two built-in methods of searching for substring in Java.

1. String.indexOf()

As far as I understand this method uses the brute-force algorithm of substring search, thus its complexity is O(nm) where n and m are lengths of string and pattern.

2. Use Pattern and Matcher

I know nothing about how the regex algorithms are implemented and about their complexity.

So the questions are:

1) Which of these methods is preferrable from the perspective of performance?

2) What is the complexity of regex search? Does it depends on the regex itself?

回答1:

Honestly, if you care about worst-case performance, JNI into native code that calls your standard library's strstr function. Well-implemented strstr, like the one in recent versions of glibc, has linear worst-case running time and constant worst-case space usage. I believe glibc's strstr can do Boyer-Moore-like long jumps through the text, too. C standard libraries are maintained by people who know how to write and maintain good and general-purpose libraries and practise their craft. The same cannot be said for the Java standard class library.

You will have to turn a Java UTF-16 string into something suitable for strstr, such as a UTF-8 string. You will also have to handle embedded zero bytes in the UTF-8 string gracefully. Other than that, you will reap the benefits of a well-written and well-maintained library.

Java does regex searches (for this particular case) using a Boyer-Moore string search hacked into a naive regex implementation. Compiling a Pattern with just your string will result in a Matcher that performs relatively well. Note, however, that this does NOT extend to anything beyond string searching with the regex library; you're still stuck with a naive regex implementation that backtracks and all if you feed it a nontrivial regular expression.

As evidence for why you shouldn't use Java regex for actual regexes, I present you the following:

public class regex {
  public static void main(String[] args) throws Exception {
    String haystack = "ab";
    String needle = "abab?.*";
    for (int i = 0; i < 7; i++) haystack = haystack + haystack;
    for (int i = 0; i < 4; i++) needle = needle + needle;
    System.out.println(haystack.length() + " " + needle.length());
    long before = System.currentTimeMillis();
    System.out.println(Pattern.matches(needle, haystack));
    long after = System.currentTimeMillis(); // long after indeed...
    System.out.println(after - before);
  }
}

This is a search in a 256-character haystack for a needle regex (that's an honest regex that you learnt about in compilers class) of 112 characters. It takes about 24 seconds to complete on my machine.