Unclosed character class near index nnn

2020-07-16 09:18发布

问题:

I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).

Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback in PHP) fail in Java with the following exception:

java.util.regex.PatternSyntaxException: Unclosed character class near index 217

The statement is obvious, the solution is elusive.

Here's the raw, multiline regex from the PHP implementation:

return preg_replace_callback('/
    (^|(?<=[\s>.\(])|[{[]) # $pre
    "                      # start
    (' . $this->c . ')     # $atts
    ([^"]+?)               # $text
    (?:\(([^)]+?)\)(?="))? # $title
    ":
    ('.$this->urlch.'+?)   # $url
    (\/)?                  # $slash
    ([^\w\/;]*?)           # $post
    ([\]}]|(?=\s|$|\)))
    /x',callback,input);

Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo that resulted in the following, rather long, regular expression:

(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))

I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.

I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.

Any ideas?

I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.

In #title switch the escaped paren:

        (?:\(([^)]+?)\)(?="))? # $title
        ...^
        (?:(\([^)]+?)\)(?="))? # $title
        ....^

Thanks, Tim

edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...

"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"

回答1:

@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped. []] or [^]] are okay because the ] is the first character other than the negating ^, but in Java an unescaped [ anywhere in a character class is a syntax error.

Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes / because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:

"(^|(?<=[\\s>.(])|[{\\[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:<(?!>)|(?<!<)>|<>|=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?)(/)?([^\\w/;]*?)([]}]|(?=\\s|$|\\)))"

Whether it's the best regex I have no idea, not knowing how it's being used.



回答2:

I'm not sure exactly where your problem lies, but this might help:

In Java (and I believe this is unique to Java), the [ symbol (not just the ] symbol) is reserved inside character classes and needs to be escaped.

The revised expression should probably be similar to the following, in order to be Java-compatible:

(^|(?<=[\s>.\(])|[{\[]) # $pre
"                       # start
(' . $this->c . ')      # $atts
([^"]+?)                # $text
(?:\(([^)]+?)\)(?="))?  # $title
":
('.$this->urlch.'+?)    # $url
(\/)?                   # $slash
([^\w\/;]*?)            # $post
([\]}]|(?=\s|$|\)))
/x

Basically, any place where most regex flavors will allow a character class like [a-z,;[\]+-] - which would match "either a letter a-z or a comma, semicolon, open or close square bracket, plus or minus sign", needs to actually be [a-z,;\[\]+-] (escape the [ with a \ character)

This escaping requirement is due to the Java union, intersection and subtraction character-class constructs.