I want to match strings enclosed in triple "
-quotes which may contain line breaks, and which don't contain any """
-substrings except at the very beginning and in the very end.
Valid example:
"""foo
bar "baz" blah"""
Invalid example:
"""foo bar """ baz"""
I tried using the following regex (as Java String
literal):
"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\""
and it seems to work on short examples. However, on longer examples, like on a string consisting of thousand lines with hello world
, it gives me a StackOverflowError
.
Scala snippet to reproduce the error
import java.util.regex.{Pattern, Matcher}
val text = "\"" * 3 + "hello world \n" * 1000 + "\"" * 3
val p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"")
println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt())
println(p.matcher(text).lookingAt())
(note: test locally, Scastie times out; or maybe reduce 1000 to smaller number?).
Java snippet that produces the same error
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class RegexOverflowMain {
public static void main(String[] args) {
StringBuilder bldr = new StringBuilder();
bldr.append("\"\"\"");
for (int i = 0; i < 1000; i++) {
bldr.append("hello world \n");
}
bldr.append("\"\"\"");
String text = bldr.toString();
Pattern p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"");
System.out.println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt());
System.out.println(p.matcher(text).lookingAt());
}
}
Question
Any idea how to make this "stack safe", i.e. can someone find a regex that accepts the same language, but does not produce a StackOverflowError
when fed to the Java regex API?
I don't care whether the solution is in Scala or Java (or whatever), as long the same underlying Java regex library is used.
Solution using a negative look-ahead to basically find a string that starts with
"""
and end with"""
and contains content that does not include"""
As Plain regex:
^"""((?!""")[\s\S])*"""$
As Java escaped regex:
"^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$
"\s\S
includes line-break (its basically.
+ line-break or.
with single line flag)This should be used without the multiline flag so that
^
and$
match the start and end of the string and not the start and end of the lineotherwise this:
""" ab """abc""" abc """
would match
Also i used this as reference for how to exclude the
"""
: Regular expression to match a line that doesn't contain a word?The full answer below optimizes the regex performance, but to prevent stack overflow, as a simple solution, just make the repeating group possessive.
Non-possessive repeating groups with choices need recursive calls to allow backtracking. Making it possessive fixes the problem, so simply add a
+
after the*
:"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*+\"\"\""
Also note that if you want to match entire input, you need to call
matches()
, notlookingAt()
.Performance boost
Note: A quick performance test showed this to be more than 6 times faster than regex in answer by x4rf41.
Instead of matching one of
[^\"]
(?:\"[^\"])
(?:\"\"[^\"])
in a loop, first match everything up to a quote. If that is a single- or double-quote, but not a triple-quote, match the 1-2 quotes then everything up to next quote, repeat as needed. Finally match the ending triple-quote.
That matching is definitive, so make the repeats possessive. This also prevent stack overflow in case input has many embedded quotes.
Since you don't use
^
or$
, there is no need for(?m)
(MULTILINE)As Java string:
"\"{3}[^\"]*+(?:\"{1,2}[^\"]++)*+\"{3}"