I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).
Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback
in PHP) fail in Java with the following exception:
java.util.regex.PatternSyntaxException: Unclosed character class near index 217
The statement is obvious, the solution is elusive.
Here's the raw, multiline regex from the PHP implementation:
return preg_replace_callback('/
(^|(?<=[\s>.\(])|[{[]) # $pre
" # start
(' . $this->c . ') # $atts
([^"]+?) # $text
(?:\(([^)]+?)\)(?="))? # $title
('.$this->urlch.'+?) # $url
(\/)? # $slash
([^\w\/;]*?) # $post
Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo
that resulted in the following, rather long, regular expression:
(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))
I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.
I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.
Any ideas?
I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.
In #title
switch the escaped paren:
(?:\(([^)]+?)\)(?="))? # $title
(?:(\([^)]+?)\)(?="))? # $title
Thanks, Tim
edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...
"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"
I'm not sure exactly where your problem lies, but this might help:
In Java (and I believe this is unique to Java), the
symbol (not just the]
symbol) is reserved inside character classes and needs to be escaped.The revised expression should probably be similar to the following, in order to be Java-compatible:
Basically, any place where most regex flavors will allow a character class like
- which would match "either a lettera
or a comma, semicolon, open or close square bracket, plus or minus sign", needs to actually be[a-z,;\[\]+-]
(escape the[
with a\
character)This escaping requirement is due to the Java union, intersection and subtraction character-class constructs.
@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped.
are okay because the]
is the first character other than the negating^
, but in Java an unescaped[
anywhere in a character class is a syntax error.Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes
because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:Whether it's the best regex I have no idea, not knowing how it's being used.