What is the reason for not allowing non-determinis

The following declaration:

<!ELEMENT p ((b, a) | (b, c))>

and its XSD equivalent are both invalid because they are not deterministic, according to validators and a quick check of the spec(s). However, since every non-deterministic finite automaton has an equivalent deterministic finite automaton and since there are algorithms for converting NFAs into DFAs, what is the reason for prohibiting non-deterministic declarations?

回答1:

There are two classes of possible answer to a question like this: technical and historical.

There is no sound technical reason for the determinism rules of XML DTDs or XSD.

It is because of the determinism rule that the intersection and union of two XSD content models are not guaranteed to be describable with a legal XSD content model. It is because of the determinism rule that some regular languages cannot be expressed as content models. It is because of the determinism rule that abstract types and substitution groups in XSD fall so far short of their potential in making vocabularies easily extensible. In short, the determinism rule contributes nothing to SGML, XML DTDs, or XSD but pointless complication (and, in the cases of SGML and XSD, crackpot termininology: ambiguity that is actually not ambiguity, and unique particle attribution instead of strong determinism -- why use five syllables when you can use nine?).

That leaves the historical answer.

I preface these remarks by saying that for my part in this sorry history I apologize to those who care about these things. I tried to persuade the WG to get rid of the determinism rules in 1996 when we were designing XML DTDs, and I failed. I tried to persuade the XML Schema WG not to adopt them in 1998-2001 when we were designing XSD 1.0, and I failed. I tried to persuade that WG to get rid of them in 2001-2012 when XSD 1.1 was gradually (very, very gradually) coming into being, and I failed again. Sorry; I did try.

Those in the SGML working group which originally created the rule (in ISO 8879) and those in the XML and XML Schema working groups who voted to retain the rule have offered a variety of rationalizations, when over the years I have asked them.

Some in the XML WG argued that the determinism rules offer a useful simplification of the parser-writer's task. Cynics have repeatedly suggested (in private) that the rule arose originally because influential members of the SGML WG couldn't figure out how to make backtracking work correctly; others in that WG have hotly denied the claim. (Interestingly, my recollection is that in the XML WG those who argued for determinism as simplifying the parser's task included James Clark, who criticizes the determinism rule in the message cited by chris in his answer to this question. I wish he had changed his mind before it was too late.) Others in the WG thought the determinism rule was bogus, but that we couldn't change it without an unacceptable incompatibility with SGML.

In the XML Schema WG, some WG members tell me they thought determinism was a good idea because it would mean XSD validators could use the content-model validation code base from existing DTD validators. (Truth! This suggests a rather touching faith in code reuse, or possibly excessive consumption of hallucinogens.) At the time, many WG members gave me the impression that they didn't feel they understood the issue clearly enough to have an independent view and thought that backward compatibility with DTDs would be safer than making a change which might have consequences they could not foresee. It would be easy (they thought) to change later if it proved unnecessary or unhelpful. (Later, during the work on XSD 1.1, vendors resisted any attempt to eliminate the determinism rule as if they were repelling an attack on their virtue. So much for "We can always relax the constraint later".)

Some people (both in the SGML and in the XSD WGs) have suggested the determinism rule is useful because it allows annotations to be associated with particular positions in the content model. In the XSD case, this strikes me as fighting a lost battle -- it is just as easy to annotate based on position in the instance, and the existing XPath infrastructure and mindshare makes that a far preferable course. In the SGML case, that argument doesn't apply since XPath hadn't yet been invented, but in SGML annotations are not allowed on individual particles of a content model, so the idea was a non-starter in any case.

This idea survives, though, in the ability of XSD schema authors to add xs:annotation elements to individual particles in a content model. I have been asking for fourteen years or so now without finding anyone who has used this facility in a production schema, anyone who has used this facility in an experimental test schema, or anyone who has heard of anyone using this facility in any kind of schema at all. Nor have I found anyone able to provide a coherent account of a concrete application in which it would be helpful. (They, in response, have pointed out to me that I have never provided an ironclad proof that it could never ever under any circumstances be helpful, in triplicate with endorsements from five full professors of software engineering. They're right; I guess I'm just lazy. But I have never understood why the bar for including stupidities in XSD was so low, and the bar for eliminating them so high.)

The only halfway plausible argument I have ever heard came from a few (very few) thoughtful WG members (very few -- well, one) who made clear that they understood the technical issues perfectly well, but that in the transaction-server deployment scenarios they had in mind, validation speed was essential even in situations where schema pre-compilation and caching would be infeasible. So they wanted to retain the determinism constraint so as to avoid (a) the cost of validating with an NFA instead of a DFA, and (b) the quadratic cost of determinizing an NFA. I don't actually think this is a compelling argument (why should schema caching be impossible for a transaction server, for heaven's sake?), but the person who made it undeniably knows more about transaction servers than I do.

In sum: SGML invented the determinism rule for reasons lost in the mists of time; no two members of the WG tell the same story, and I conclude that there was no consensus there on the reason to have the rule, only a lot of individual reasons. XML DTDs retained the rule for compatibility with SGML (it was not enough that legal SGML DTDs be legal XML, we wanted all legal XML DTDs to be legal SGML -- the WG was made up of SGML people). And XSD got the determinism rule from XML DTDs and then retained it for no particular reasons but fear, uncertainty, and doubt.

Sigh.

回答2:

Apparently the only justification is SGML compatibility: http://lists.w3.org/Archives/Public/xml-editor/2001JanMar/0011.html