How to properly escape Regular Expression pattern

2019-09-04 19:40发布

问题:

I need to fulfill a requirement to only accept values in the form of MM/DD/YYYY.

From what I've read on: https://www.w3.org/TR/xmlschema11-2/#nt-dateRep Using

<xs:simpleType name="DATE">
        <xs:restriction base="xs:date"/>
    </xs:simpleType>

Is not going to work as its regex apparently is not supporting this format.

I have found and adjusted this format:

^(?:(?:(?:0?[13578]|1[02])(\/)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

To this form:

\^\(\?:\(\?:\(\?:0\?\[13578\]\|1\[02\]\)\(\\/\)31\)\1\|\(\?:\(\?:0\?\[1,3-9\]\|1\[0-2\]\)\(\\/\)\(\?:29\|30\)\2\)\)\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\d{2}\)$\|\^\(\?:0\?2\(\\/\)29\3\(\?:\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\(\?:0\[48\]\|\[2468\]\[048\]\|\[13579\]\[26\]\)\|\(\?:\(\?:16\|\[2468\]\[048\]\|\[3579\]\[26\]\)00\)\)\)\)$\|\^\(\?:\(\?:0\?\[1-9\]\)\|\(\?:1\[0-2\]\)\)\(\\/\)\(\?:0\?\[1-9\]\|1\d\|2\[0-8\]\)\4\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\d{2}\)$

Now I no longer get invalid escaping errors in XML editors (using XML Spy), but I get this one:

invalid-escape: The given character escape is not recognized.

I have done the escape according to the XML schema specifications here: https://www.w3.org/TR/xmlschema-2/#regexs Section F.1.1 there is an escape table.

Can anyone please help to nail this down right?

Thanks!

回答1:

If you check the XSD regex syntax resources, you will notice that there is no support for non-capturing groups ((?:...)), nor backreferences (the \n like entities to refer to the text captured with capturing groups, (...)).

Since the only delimiter is /, you can get rid of the backreference completely.

Use

((((0?[13578]|1[02])/31)/|((0?[13-9]|1[0-2])/(29|30)/))((1[6-9]|[2-9]\d)?\d{2}‌​)|(0?2/29/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[35‌​79][26])00))))|(0?[1-9]|1[0-2])/(0?[1-9]|1\d|2[0-8])/(1[6-9]|[2-9]\d)?\d{2})

See this regex demo

Note that acc. to regular-expressions.info:

Particularly noteworthy is the complete absence of anchors like the caret and dollar, word boundaries, and lookaround. XML schema always implicitly anchors the entire regular expression. The regex must match the whole element for the element to be considered valid.

So, you should not use ^ (start of string) and $ (end of string) in XSD regex.

The / symbol is escaped in regex flavors where it is a regex delimiter, and in XSD regex, there are no regex delimiters (as the only action is matching, and there are no modifiers: XML schemas do not provide a way to specify matching modes). So, do not escape / in XSD regex.

TESTING AT ONLINE TESTERS NOTE

If you test at regex101.com or similar sites, note that in most cases you need to escape the / if it is selected as a regex delimiter. You can safely remove the \ before / after you finished testing.



回答2:

OK, so you're starting from this (I'm going to insert newlines for readability):

    ^(?:(?:(?:0?[13578]|1[02])(\/)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/)
(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
|^(?:0?2(\/)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|
^(?:(?:0?[1-9])|(?:1[0-2]))(\/)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Horrendous stuff. Now, in XSD:

(a) there are no ^ and $ anchors, they aren't needed (the pattern is implicitly anchored). So take them out. You've responded by escaping them as \^ and \$ but that doesn't make sense: you don't actually want circumflexes and dollar signs in your input.

(b) XSD doesn't recognize non-capturing groups (?:xxxx). Just replace them with capturing groups - that is, remove the ?: Again, you've escaped the question marks, which doesn't make any sense at all.

(c) The \d should probably be [0-9], unless you actually want to match non-ASCII digits (e.g. Thai or Eastern Arabic digits)

(d) Slash (/) doesn't need to be escaped, and indeed can't be escaped. So replace \/ with /.

(e) I see some back-references, \1, \2, \4. XSD regexes do not allow back-references. But as far as I can see, the back-references in this regex serve no useful purpose. Most of them seem to be back-references to a group of the form (\/) which can only match a single slash, so the back-reference \1 can be simply replaced with /. Maybe they are throwbacks to some earlier form of the regex that allowed alternative delimiters but required them to be consistent.

From your attempts to fix the problems here, it seems to me that you don't have a very thorough understanding of regular expressions. I fear that to get this working, you are going to have to bite the bullet and learn how it works; debugging complex regular expressions is difficult, and you won't get it right by trial and error.