I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).
Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback
in PHP) fail in Java with the following exception:
java.util.regex.PatternSyntaxException: Unclosed character class near index 217
The statement is obvious, the solution is elusive.
Here's the raw, multiline regex from the PHP implementation:
return preg_replace_callback('/
(^|(?<=[\s>.\(])|[{[]) # $pre
" # start
(' . $this->c . ') # $atts
([^"]+?) # $text
(?:\(([^)]+?)\)(?="))? # $title
":
('.$this->urlch.'+?) # $url
(\/)? # $slash
([^\w\/;]*?) # $post
([\]}]|(?=\s|$|\)))
/x',callback,input);
Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo
that resulted in the following, rather long, regular expression:
(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))
I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.
I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.
Any ideas?
I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.
In #title
switch the escaped paren:
(?:\(([^)]+?)\)(?="))? # $title
...^
(?:(\([^)]+?)\)(?="))? # $title
....^
Thanks, Tim
edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...
"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"
I'm not sure exactly where your problem lies, but this might help:
In Java (and I believe this is unique to Java), the
[
symbol (not just the]
symbol) is reserved inside character classes and needs to be escaped.The revised expression should probably be similar to the following, in order to be Java-compatible:
Basically, any place where most regex flavors will allow a character class like
[a-z,;[\]+-]
- which would match "either a lettera
-z
or a comma, semicolon, open or close square bracket, plus or minus sign", needs to actually be[a-z,;\[\]+-]
(escape the[
with a\
character)This escaping requirement is due to the Java union, intersection and subtraction character-class constructs.
@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped.
[]]
or[^]]
are okay because the]
is the first character other than the negating^
, but in Java an unescaped[
anywhere in a character class is a syntax error.Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes
/
because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:Whether it's the best regex I have no idea, not knowing how it's being used.