Regex for multiline string literals produces `Stac

2019-07-15 05:42发布

I want to match strings enclosed in triple "-quotes which may contain line breaks, and which don't contain any """-substrings except at the very beginning and in the very end.

Valid example:

"""foo
bar "baz" blah"""

Invalid example:

"""foo bar """ baz"""

I tried using the following regex (as Java String literal):

"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\""

and it seems to work on short examples. However, on longer examples, like on a string consisting of thousand lines with hello world, it gives me a StackOverflowError.

Scala snippet to reproduce the error

import java.util.regex.{Pattern, Matcher}

val text = "\"" * 3 + "hello world \n" * 1000 + "\"" * 3
val p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"")
println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt())
println(p.matcher(text).lookingAt())

(note: test locally, Scastie times out; or maybe reduce 1000 to smaller number?).

Java snippet that produces the same error

import java.util.regex.Pattern;
import java.util.regex.Matcher;

class RegexOverflowMain {
  public static void main(String[] args) {
    StringBuilder bldr = new StringBuilder();
    bldr.append("\"\"\"");
    for (int i = 0; i < 1000; i++) {
      bldr.append("hello world \n");
    }
    bldr.append("\"\"\"");
    String text = bldr.toString();
    Pattern p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"");
    System.out.println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt());
    System.out.println(p.matcher(text).lookingAt());
  }
}

Question

Any idea how to make this "stack safe", i.e. can someone find a regex that accepts the same language, but does not produce a StackOverflowError when fed to the Java regex API?

I don't care whether the solution is in Scala or Java (or whatever), as long the same underlying Java regex library is used.

2条回答
冷血范
2楼-- · 2019-07-15 05:45

Solution using a negative look-ahead to basically find a string that starts with """ and end with """ and contains content that does not include """

As Plain regex: ^"""((?!""")[\s\S])*"""$

As Java escaped regex: "^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$"

\s\S includes line-break (its basically . + line-break or . with single line flag)

This should be used without the multiline flag so that ^ and $ match the start and end of the string and not the start and end of the line

otherwise this:

""" ab """abc""" abc """

would match

Also i used this as reference for how to exclude the """: Regular expression to match a line that doesn't contain a word?

查看更多
We Are One
3楼-- · 2019-07-15 06:02

The full answer below optimizes the regex performance, but to prevent stack overflow, as a simple solution, just make the repeating group possessive.

Non-possessive repeating groups with choices need recursive calls to allow backtracking. Making it possessive fixes the problem, so simply add a + after the *:

"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*+\"\"\""


Also note that if you want to match entire input, you need to call matches(), not lookingAt().


Performance boost

Note: A quick performance test showed this to be more than 6 times faster than regex in answer by x4rf41.

Instead of matching one of

  • Not a quote: [^\"]
  • Exactly one quote: (?:\"[^\"])
  • Exactly two quotes: (?:\"\"[^\"])

in a loop, first match everything up to a quote. If that is a single- or double-quote, but not a triple-quote, match the 1-2 quotes then everything up to next quote, repeat as needed. Finally match the ending triple-quote.

That matching is definitive, so make the repeats possessive. This also prevent stack overflow in case input has many embedded quotes.

"{3}          match 3 leading quotes
[^"]*+        match as many non-quotes as possible (if any) {possesive}
(?:           start optional repeating group
   "{1,2}       match 1-2 quotes
   [^"]++       match one or more non-quotes (at least one) {possesive}
)*+           end optional repeating group                  {possesive}
"{3}          match 3 trailing quotes

Since you don't use ^ or $, there is no need for (?m) (MULTILINE)

As Java string:

"\"{3}[^\"]*+(?:\"{1,2}[^\"]++)*+\"{3}"

查看更多
登录 后发表回答