lxml unicode entity parse problems

2019-02-26 07:58发布

问题:

I'm using lxml as follows to parse an exported XML file from another system:

xmldoc = open(filename)
etree.parse(xmldoc)

But im getting:

lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46

Obviously it's having problems with unicode entity names - but how would i get round this? Via open() or parse()?

Edit: I had forgotten to include my DTD in the same folder - it's there now and has the following declaration:

<!ENTITY eacute "&#233;">

and is referred to (and always was) in xmldoc as so:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE DScribeDatabase SYSTEM "foo.dtd">

Yet I still get the same problem ... does the DTD need to be declared in Python too?

回答1:

eacute is not a predefined entity in XML. To include an &eacute; entity reference in an XML file, it must have a <!DOCTYPE> declaration pointing to a DTD (such as an XHTML 1.0 DTD) that defines the entity.

If the XML uses &eacute; but doesn't have a <!DOCTYPE>, it is not well-formed and the system that exported it needs to be fixed.

(There isn't a good reason to use an entity reference to represent é in an XML file. The character reference &#233; is understood everywhere without entity definitions, if the file can't simply include a raw UTF-8 é for some reason.)