According to the lxml documentation "The DTD is retrieved automatically based on the DOCTYPE of the parsed document. All you have to do is use a parser that has DTD validation enabled."
http://lxml.de/validation.html#validation-at-parse-time
However, if you want to validate against an XML schema, you need to explicitly reference one.
I am wondering why this is and would like to know if there is a library or function that can do this. Or even an explanation of how to make this happen myself. The problem is there seems to be many ways to reference an XSD and I need to support all of them.
Validation is not the issue. The issue is how to determine the schemas to validate against. Ideally this would handle inline schemas as well.
Update:
Here is an example.
simpletest.xsd:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:element name="name" type="xs:string"/>
</xs:schema>
simpletest.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<name xmlns="http://www.example.org"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.example.org simpletest.xsd">foo</name>
I would like to do something like the following:
>>> parser = etree.XMLParser(xsd_validation=True)
>>> tree = etree.parse("simpletest.xml", parser)
I have a project that has over 100 different schemas and xml trees. In order to manage all of them and validate them i did a few things.
1) I created a file (i.e. xmlTrees.py) where i created a dictionary of every xml and corresponding schema associated with it, and the xml path. This allowed me to have a single place to get both xml & the schema used to validate that xml.
MY_XML = {'url':'/pathToTree/myTree.xml', 'schema':'myXSD.xsd'}
2) In the project we have equally as many namespaces (very hard to manage). So what i did was again i created a single file that contained all the namespaces in the format lxml likes. Then in my tests and scripts i would just always pass the superset of namespaces.
ALL_NAMESPACES = {
'namespace1': 'http://www.example.org',
'namespace2': 'http://www.example2.org'
}
3) For basic/generic validation i ended up creating a basic function i could call:
def validateXML(content, schemaContent):
try:
xmlSchema_doc = etree.parse(schemaContent);
xmlSchema = etree.XMLSchema(xmlSchema_doc);
xml = etree.parse(StringIO(content));
except:
logging.critical("Could not parse schema or content to validate xml");
response['valid'] = False;
response['errorlog'] = "Could not parse schema or content to validate xml";
response = {}
# Validate the content against the schema.
try:
xmlSchema.assertValid(xml)
response['valid'] = True
response['errorlog'] = None
except etree.DocumentInvalid, info:
response['valid'] = False
response['errorlog'] = xmlSchema.error_log
return response
basically any function that wants to use this needs to send the xml content and the xsd content as strings. This provided me with the most flexability. I then just placed this function in a file where i had all my xml helper functions.
You could extract the schemas yourself and import them into a root schema:
from lxml import etree
XSI = "http://www.w3.org/2001/XMLSchema-instance"
XS = '{http://www.w3.org/2001/XMLSchema}'
SCHEMA_TEMPLATE = """<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns="http://dummy.libxml2.validator"
targetNamespace="http://dummy.libxml2.validator"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
</xs:schema>"""
def validate_XML(xml):
"""Validate an XML file represented as string. Follow all schemaLocations.
:param xml: XML represented as string.
:type xml: str
"""
tree = etree.XML(xml)
schema_tree = etree.XML(SCHEMA_TEMPLATE)
# Find all unique instances of 'xsi:schemaLocation="<namespace> <path-to-schema.xsd> ..."'
schema_locations = set(tree.xpath("//*/@xsi:schemaLocation", namespaces={'xsi': XSI}))
for schema_location in schema_locations:
# Split namespaces and schema locations ; use strip to remove leading
# and trailing whitespace.
namespaces_locations = schema_location.strip().split()
# Import all found namspace/schema location pairs
for namespace, location in zip(*[iter(namespaces_locations)] * 2):
xs_import = etree.Element(XS + "import")
xs_import.attrib['namespace'] = namespace
xs_import.attrib['schemaLocation'] = location
schema_tree.append(xs_import)
# Contstruct the schema
schema = etree.XMLSchema(schema_tree)
# Validate!
schema.assertValid(tree)
BTW, your simpletest.xsd is missing the targetNamespace.
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org" elementFormDefault="qualified">
<xs:element name="name" type="xs:string"/>
</xs:schema>
With the code above, your example document validates against this schema.