How to rewrite this nondeterministic XML Schema to

2019-08-02 16:38发布

问题:

Why this is non-deterministic and how to fix it?

 <xs:element name="activeyears">
        <xs:complexType>
            <xs:sequence minOccurs="0" maxOccurs="1">
                <xs:sequence minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="from" minOccurs="1" maxOccurs="1"/>
                    <xs:element ref="till" minOccurs="1" maxOccurs="1"/>
                </xs:sequence>
                <xs:element ref="from" minOccurs="0" maxOccurs="1"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

It is supposed to mean that <activeyears> is either empty or contains sequence of <from><till> which starts with <from> but can end with either.

回答1:

A schema is non-deterministic when there are two branches that begin with the same element - so that you cannot tell which branch to take without looking ahead after that element. A simple example is ab|ac - when you see an a, you don't know which branch to take. For loops, the "branch" is whether to repeat the loop, or continue after it. An example of this is a*a - once you are in the loop, and you read an a, you don't know whether to repeat the loop, or continue.

Looking at your example schema, imagine that it has just parsed a <till>, and now it needs to parse a <from>. You could parse it with the <from><till> loop or with the final <from>. You can't tell which branch to use, just by looking at that <from>. You can only tell with further looking-ahead.


Bad news: I think your example schema is a very rare one, that it is impossible to express deterministically!

Here are the XML documents you want to accept (I'm using a single letter for each element, where a = <from>...</from> and b = <to>...</to>:

*empty*
a
ab
aba
abab
ababa
ababab
...

... you get the idea. The problem is that any letter can be the final letter in the sequence or it can be part of the loop. There is no way to tell which it will be, except by looking-ahead at the following letter. Since "deterministic" means that you don't do this lookahead (by definition), the language that you want cannot be expressed deterministically.

Simplifying your schema, it tries an approach similar to (ab)*a? - but both branches start with a. Another approach is a(ba)*b? - now both branches start with b. We can't win!

Technically, the set of all documents that a schema will accept is called that schema's language. If no deterministic schema exists that can express a language, the language is called "one-ambiguous".

For a theoretic discussion, see the series of papers by Bruggemann-Klein (e.g. Deterministic Regular Languages and One-Unambiguous Regular Languages). She includes a formal test for one-unambiguous languages.



回答2:

This is a simple edit of your code; I haven't tried it:

 <xs:element name="activeyears">
        <xs:complexType>
            <xs:sequence minOccurs="0" maxOccurs="1">
                <xs:element ref="from" minOccurs="1" maxOccurs="1"/>
                <xs:sequence minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="till" minOccurs="1" maxOccurs="1"/>
                    <xs:element ref="from" minOccurs="0" maxOccurs="1"/>
                </xs:sequence>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

Some background: XML schema is a very simple grammar, and the schema processor is a parser that attempts to apply the rules of this grammar to the input file. Unlike the parsers used by traditional compilers, however, XML schema has no lookahead. So you can't have two rules that share the same initial set of tokens (element names).

So, the specific changes that I made:

  • I left your outer sequence unchanged; it controls the "empty or has specific content" requirement.
  • If there is content, it must start with "from"; so I made that the first element in the sequence, with explicit occurrence count
  • Since I used "from" as an explicit element, I had to reverse the order of the subsequence.
  • And unless you want to specify that every "till" must be followed by a "from", you need to relax the minOccurs in the subsequence.
  • The subsequence also handles the case of a single from/till -- as a commenter noted, my second edit with the minOccurs='0' allowed a terminating sequence of two "till"s.