I need to verify that a "simplified" DTD is really a subset of a larger DTD, I.e. that documents that are valid according to the "simplified" DTD will also always be valid according to the larger (or "master") DTD.
The simplified DTD is being written now -- it's derived from the master DTD (were it the other way around, one could simply include the smaller DTD into the larger one).
How would I be able to determine if the simplified DTD is derived from the master DTD?
DTDs are really just context-free grammars in disguise. A grammar G represents the set of possible legal strings that comprise the unstated language L(G) that grammar represents.
What you are asking is tantamount to determining if you have G1 and G2, whether L(G1) is a subset of L(G2). My language theory is getting rusty and I don't remember if this is computable in general or not, but my guess this is really hard, because you have to demonstrate that an arbitrary derivation in G1 always has a derivation in G2.
You might be able to answer the question of whether G1 is structured in such a way that you can demonstrate that L(G1) is a subset of L(G2) by demonstrating that each element of G1 is compatible with each element of G2, essentialy by showing that each grammar rule
in G1 has a corresponding rule in G2 with elements dropped. Your idea of diffing DTDs seems to be along this line, with the proviso that if the the diffs are large you are stuck with the general problem rather than the simpler one. At least the way you've characterized the problem (G2 is derived from the master DTD) I think you have chance.
The purpose of the diff would be to identify compatible rules by finding the least differences.
If you have grammar rule g2 = A ; and another g1 = A that you'd claim are related and
that you'd like to check,
you'd first have to demonstrate that the string tokens that A derived in G1 is a superset
of the string of tokens A derived in G2. This looks just like the original unconstrained problem of comparing two langauges; we're now just comparing the sublanguages for the two rules g1 and g2.
So now I think you have to insist that each subrule reachable by g1 is compatible structurally with a corresponding subrule in g2 to make this practical.
I think you can probably write a recursive procedure to check this. What this procedure mostly needs as help is all the set operators (FirstOf, ..) that you tend to find in an LALR parser generator.
On a different front, my company makes Smart Differencer tools, that compute deltas over language constructs in terms of langauge elements and editing operations on those elements. It is parameterized by language definitions. The SmartDifference presently works for a variety of conventional languages (C, C++, C#, COBOL, Java, PHP, Python, ....). XML (and DTDs) are a language, too, for which we have a language definition, and we've built an experimental XML Smart Differencer tools. It ought to work on DTDs just fine. Contact me offline (see bio) if you have further direct interest.
EDIT: Just for grins, I tried the following two DTDs, one derived from the other:
orderform.xml:
<?xml version='1.0' ?>
<!DOCTYPE orderform [
<!ELEMENT orderform (name,company,address,items) >
<!ELEMENT name ( firstname, lastname )>
<!ELEMENT firstname ( #PCDATA )>
<!ELEMENT lastname ( #PCDATA )>
<!ELEMENT company ( #PCDATA )>
<!ELEMENT address ( street, city, country )>
<!ELEMENT street ( #PCDATA )>
<!ELEMENT city( #PCDATA )>
<!ELEMENT country ( zipcode | nation )>
<!ELEMENT zipcode ( #PCDATA )>
<!ELEMENT nation ( #PCDATA )>
<!ELEMENT items (item)+ >
<!ELEMENT item ( partnumber, quantity, unitprice)>
<!ELEMENT partnumber ( #PCDATA )>
<!ELEMENT quantity ( #PCDATA )>
<!ELEMENT unitprice ( #PCDATA )>
]>
<done/>
and orderform2.xml:
<?xml version='1.0' ?>
<!DOCTYPE orderform [
<!ELEMENT orderform (name,company,location,item) >
<!ELEMENT name ( firstname, lastname )>
<!ELEMENT firstname ( #PCDATA )>
<!ELEMENT lastname ( #PCDATA )>
<!ELEMENT company ( #PCDATA )>
<!ELEMENT location ( street, city, country )>
<!ELEMENT street ( #PCDATA )>
<!ELEMENT city( #PCDATA )>
<!ELEMENT country ( zipcode | nation )>
<!ELEMENT zipcode ( #PCDATA )>
<!ELEMENT nation ( #PCDATA )>
<!ELEMENT item ( partnumber, unitprice)>
<!ELEMENT partnumber ( #PCDATA )>
<!ELEMENT quantity ( #PCDATA )>
<!ELEMENT unitprice ( #PCDATA )>
]>
<done/>
[See if you can spot the differences yourself, first :-)
And ran the XML SmartDifferencer:
C:\DMS\Domains\XML\Analyzers\SmartDifferencer\Source>DMSSmartDifferencer XML -SuppressSourceCodeForRenamings C:\DMS\Domains\XML\Tool
s\DTD2COBOL\orderform.xml C:\DMS\Domains\XML\Tools\DTD2COBOL\orderform2.xml
Copyright (C) 2009 Semantic Designs; All Rights Reserved
XML SmartDifferencer Version 1.1.1
Copyright (C) 2009 Semantic Designs, Inc; All Rights Reserved; SD Confidential
Powered by DMS (R) Software Reengineering Toolkit
*** Unregistered SmartDifferencer Version 1.1
*** Operating with evaluation limits.
*** Parsing file C:/DMS/Domains/XML/Tools/DTD2COBOL/orderform.xml ...
*** Parsing file C:/DMS/Domains/XML/Tools/DTD2COBOL/orderform2.xml ...
*** Creating suffix tree ...
*** Determining maximal pairs ...
*** Sorting maximal pairs ...
*** Determining differences ...
*** Printing edits ...
Rename 4.1-9.44 to 4.1-9.45 with 'address'->'location' and 'items'~>'item'
Delete 15.1-15.25 merging 15.18-15.21 into 4.44-4.47
<<!ELEMENT items (item)+ >
Delete 16.30-16.38 merging 16.30-16.38 into 15.18-15.28 with 'quantity'~>'partnumber'
< quantity,
Yep, that's what I did to get the derived one. (The notation N.M means "line N, column M").