Stack overflow when trying to use regex in java

2019-08-21 02:22发布

问题:

I have read up on some articles on how to optimize regex but none of the answers (less groups, using {X,Y} instead of *) seemed to stop my regex from getting a stack overflow error.

I am trying to make a dynamic search through a file. Lets say i am searching for 'i bet you cannot find me' in a file that is pretty large (2-4 mb). My regex generator would generate the regex:

i(?:.|\s)*?bet(?:.|\s)*?you(?:.|\s)*?cannot(?:.|\s)*?find(?:.|\s)*?me

the idea to this regex is that it finds the exact phrase no matter what characters or white space comes between the words. However when i try to use:

Pattern p = Pattern.compile(generatedRegex, Pattern.MULTILINE);
Matcher m = p.matcher(fileContentsAsString);
while (m.find()) {
System.out.println(m.group())
}

I am getting a stack overflow error. I know that regex use recursion but it doesnt seem like this is that bad of a regex. Is there any way I can optimize this regex? Thanks!

ANSWER:

Pattern p = Pattern.compile("i(?:.*)bet(?:.*)you(?:.*)cannot(?:.*)find(?:.*?)me", Pattern.DOTALL);

is the pattern/regex that I ultimately am using. Seems fast and no longer getting a stack overflow exception

回答1:

I think you are getting a lot of backtracking because of your reluctant qualifiers (*?). One way to prevent backtracking is to use atomic grouping (?>X), and/or possessive qualifier (*+).

According to the comments, you also prefer to capture only the "i" that is nearest to "bet" to reduce the length of the overall match. Since you want to get the closest 'i' to the rest of the words, then in the place where I added negative lookahead for word two, you would put also a negative lookahead for word one, right beside it. In other words, (?!bet) would become (?!i)(?!bet) or (?!i|bet). I have edited the code below to include this requirement.

String fileContentsAsString = "ii ... bet ... you, ibetyouyou";
String regex = "i(?>(?!i|bet).)*+bet(?>(?!you).)*+you";
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(fileContentsAsString);
while (m.find()) {
    System.out.println(m.group());
}

Output:

i .... bet .... you

ibetyou

Explanation (source):

"The way a reluctant quantifier works is, each time it's supposed to try to match, it first tries to let the next part of the regex match instead. So it's effectively doing a lookahead at the beginning of each iteration, which can get pretty expensive, especially when the quantified part only matches one character per iteration, like .*?"