可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

According to the lxml documentation "The DTD is retrieved automatically based on the DOCTYPE of the parsed document. All you have to do is use a parser that has DTD validation enabled."

http://lxml.de/validation.html#validation-at-parse-time

However, if you want to validate against an XML schema, you need to explicitly reference one.

I am wondering why this is and would like to know if there is a library or function that can do this. Or even an explanation of how to make this happen myself. The problem is there seems to be many ways to reference an XSD and I need to support all of them.

Validation is not the issue. The issue is how to determine the schemas to validate against. Ideally this would handle inline schemas as well.

Update:

Here is an example.

simpletest.xsd:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="name" type="xs:string"/>
</xs:schema>

simpletest.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<name xmlns="http://www.example.org"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.example.org simpletest.xsd">foo</name>

I would like to do something like the following:

>>> parser = etree.XMLParser(xsd_validation=True)
>>> tree = etree.parse("simpletest.xml", parser)

回答1:

I have a project that has over 100 different schemas and xml trees. In order to manage all of them and validate them i did a few things.

1) I created a file (i.e. xmlTrees.py) where i created a dictionary of every xml and corresponding schema associated with it, and the xml path. This allowed me to have a single place to get both xml & the schema used to validate that xml.

MY_XML = {'url':'/pathToTree/myTree.xml', 'schema':'myXSD.xsd'}

2) In the project we have equally as many namespaces (very hard to manage). So what i did was again i created a single file that contained all the namespaces in the format lxml likes. Then in my tests and scripts i would just always pass the superset of namespaces.

ALL_NAMESPACES = {
    'namespace1':  'http://www.example.org',
    'namespace2':  'http://www.example2.org'
}

3) For basic/generic validation i ended up creating a basic function i could call:

    def validateXML(content, schemaContent):

    try:
        xmlSchema_doc = etree.parse(schemaContent);
        xmlSchema = etree.XMLSchema(xmlSchema_doc);
        xml = etree.parse(StringIO(content));
    except:
        logging.critical("Could not parse schema or content to validate xml");
        response['valid'] = False;
        response['errorlog'] = "Could not parse schema or content to validate xml";

    response = {}
    # Validate the content against the schema.
    try:
        xmlSchema.assertValid(xml)
        response['valid'] = True
        response['errorlog'] = None
    except etree.DocumentInvalid, info:
        response['valid'] = False
        response['errorlog'] = xmlSchema.error_log

    return response

basically any function that wants to use this needs to send the xml content and the xsd content as strings. This provided me with the most flexability. I then just placed this function in a file where i had all my xml helper functions.

回答2:

You could extract the schemas yourself and import them into a root schema:

from lxml import etree

XSI = "http://www.w3.org/2001/XMLSchema-instance"
XS = '{http://www.w3.org/2001/XMLSchema}'


SCHEMA_TEMPLATE = """<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns="http://dummy.libxml2.validator"
targetNamespace="http://dummy.libxml2.validator"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
</xs:schema>"""


def validate_XML(xml):
    """Validate an XML file represented as string. Follow all schemaLocations.

    :param xml: XML represented as string.
    :type xml: str
    """
    tree = etree.XML(xml)
    schema_tree = etree.XML(SCHEMA_TEMPLATE)
    # Find all unique instances of 'xsi:schemaLocation="<namespace> <path-to-schema.xsd> ..."'
    schema_locations = set(tree.xpath("//*/@xsi:schemaLocation", namespaces={'xsi': XSI}))
    for schema_location in schema_locations:
        # Split namespaces and schema locations ; use strip to remove leading
        # and trailing whitespace.
        namespaces_locations = schema_location.strip().split()
        # Import all found namspace/schema location pairs
        for namespace, location in zip(*[iter(namespaces_locations)] * 2):
            xs_import = etree.Element(XS + "import")
            xs_import.attrib['namespace'] = namespace
            xs_import.attrib['schemaLocation'] = location
            schema_tree.append(xs_import)
    # Contstruct the schema
    schema = etree.XMLSchema(schema_tree)
    # Validate!
    schema.assertValid(tree)

BTW, your simpletest.xsd is missing the targetNamespace.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org" elementFormDefault="qualified">
    <xs:element name="name" type="xs:string"/>
</xs:schema>

With the code above, your example document validates against this schema.