-->

What is the regular expression for the set of stri

2019-05-31 09:07发布

问题:

I want write an XSD to restrict the content of valid XML elements of type xsd:token such that at validation they would indistinguishable from the same content wrapped in xsd:string.

I.e. they do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, begin or end with a space (#x20) character, and do not include a sequence of two or more adjacent space characters.

I think the regular expression to use is this:

\S+( \S+)*

(some non-whitespace, optional [single spaces next to one or more non-whitespaces], including always non-whitespace to close out)

This works with various regex testing tools but I can't seem to check it using oXygen XML Editor; double spaces, leading and trailing spaces, tabs, and line breaks in the strings seem to allow the XML instance to still pass validation.

Here's the XSD implementation:

<xs:simpleType name="Tokenized500Type">
    <xs:restriction base="xs:token">
      <xs:maxLength value="500"/>
      <xs:minLength value="1"/>
      <xs:pattern value="\S+( \S+)*"/>
    </xs:restriction>
  </xs:simpleType>

Is there some feature of

  • XML

or

  • XSD

or

  • oXygen XML Editor

that prevents this working?

回答1:

Your original ([^\s])+( [^\s]+)*([^\s])* regex contains some redundant patterns: it matches and captures each iteration of 1+ non-whitespaces, then matches 0+ sequences of space and 1+ non-whitespaces, and then again tries to match and capture each iteration of a non-whitespace.

You may use a similar, but shorter

\S+( \S+)*

Since XML Schema regex is anchored by default, there expression matches:

  • \S+ - one or more chars other than whitespace, specifically &#20; (space), \t (tab), \n (newline) and \r (return)
  • ( \S+)* - zero or more sequences of a space and 1+ whitespaces.

This expression disallows duplicate consecutive spaces and no spaces at leading/trailing position.

Here is how the regex should be used:

<xs:simpleType name="Tokenized500Type">
  <xs:restriction base="xs:string">
    <xs:pattern value="\S+( \S+)*"/>
    <xs:maxLength value="500"/>
    <xs:minLength value="1"/>
  </xs:restriction>
</xs:simpleType>


回答2:

The base type needs to be xsd:string.

Using xsd:Token tokenizes the input, THEN checks if it's a token. That is redundant.