I want to understand the performance issues that can emerge while making substring search in Java. I know the two built-in methods of searching for substring in Java.
1. String.indexOf()
As far as I understand this method uses the brute-force algorithm of substring search, thus its complexity is O(nm) where n and m are lengths of string and pattern.
2. Use Pattern and Matcher
I know nothing about how the regex algorithms are implemented and about their complexity.
So the questions are:
1) Which of these methods is preferrable from the perspective of performance?
2) What is the complexity of regex search? Does it depends on the regex itself?
Honestly, if you care about worst-case performance, JNI into native code that calls your standard library's
strstr
function. Well-implementedstrstr
, like the one in recent versions of glibc, has linear worst-case running time and constant worst-case space usage. I believe glibc'sstrstr
can do Boyer-Moore-like long jumps through the text, too. C standard libraries are maintained by people who know how to write and maintain good and general-purpose libraries and practise their craft. The same cannot be said for the Java standard class library.You will have to turn a Java UTF-16 string into something suitable for
strstr
, such as a UTF-8 string. You will also have to handle embedded zero bytes in the UTF-8 string gracefully. Other than that, you will reap the benefits of a well-written and well-maintained library.Java does regex searches (for this particular case) using a Boyer-Moore string search hacked into a naive regex implementation. Compiling a
Pattern
with just your string will result in aMatcher
that performs relatively well. Note, however, that this does NOT extend to anything beyond string searching with the regex library; you're still stuck with a naive regex implementation that backtracks and all if you feed it a nontrivial regular expression.As evidence for why you shouldn't use Java regex for actual regexes, I present you the following:
This is a search in a 256-character haystack for a needle regex (that's an honest regex that you learnt about in compilers class) of 112 characters. It takes about 24 seconds to complete on my machine.