Using Python and lxml to validate XML against an e

2019-02-20 02:52发布

问题:

I'm trying to validate an XML file against an external DTD referenced in the doctype tag. Specifically:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
...the rest of the document...

I'm using Python 3.3 and the lxml module. From reading http://lxml.de/validation.html#validation-at-parse-time, I've thrown this together:

enexFile = open(sys.argv[2], mode="rb") # sys.argv[2] is the path to an XML file in local storage.
enexParser = etree.XMLParser(dtd_validation=True)
enexTree = etree.parse(enexFile, enexParser)

From what I understand of validation.html, the lxml library should now take care of retrieving the DTD and performing validation. But instead, I get this:

$ ./mapwrangler.py validate notes.enex
Traceback (most recent call last):
  File "./mapwrangler.py", line 27, in <module>
    enexTree = etree.parse(enexFile, enexParser)
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
  File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
  File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
  File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
  File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: Validation failed: no DTD found !, line 3, column 43

This surprises me, because if I turn off validation, then the document parses in just fine and I can do print(enexTree.docinfo.doctype) to get

$ ./mapwrangler.py validate notes.enex
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">

So it looks to me like there shouldn't be any problem finding the DTD.

Thanks for your help.

回答1:

You need to add no_network=False when constructing the parser object. This option is set to True by default.

From the documentation of parser options at http://lxml.de/parsing.html#parsers:

no_network - prevent network access when looking up external documents (on by default)



回答2:

For a reason I still don't know, my problem was related to where the XML catalog was located on my local file system.

In my case, I use an XML editor that has a tight integration with a component content management system (CCMS, in this case SDL Trisoft 2011 R2). When the editor connects to the CCMS, DTDs, catalog files and a bunch of other files are synced. These files end up on the local file system in:

C:\Users\[username]\AppData\Local\Trisoft\InfoShare Client\[id]\Config\DocTypes\catalog.xml

I could not get that to work. Simply COPYING the whole catalog to another location fixed things, and this works:

f = r"path/to/my/file.xml"
# set XML catatog file path
os.environ['XML_CATALOG_FILES'] = r'C:\DATA\Mydoctypes\catalog.xml'
# configure parser
parser = etree.XMLParser(dtd_validation=True, no_network=True)
# validate
try:
   valid = etree.parse(f, parser=parser)
    print("This file is valid against the DTD.")
except etree.XMLSyntaxError, error:
   print("This file is INVALID against the DTD!")
   print(error)

Obviously this is not ideal, but it works.

Could it be something to do with file permissions, or perhaps that good old "file path too long" problem in Windows? I have not tried whether a symbolic link would work.

I am using Windows 7, Python 2.7.11 and the version of lxml is (3.6.0).