Checking if XML declaration is present

2019-08-02 04:09发布

问题:

I am trying to check whether an xml file contains the necessary xml declaration ("header"), let's say:

<?xml version="1.0" encoding="UTF-8"?>
...rest of xml file...

I am using xml ElementTree for reading and getting info out of the file, but it seems to load a file just fine even if it does not have the header.

What I tried so far is this:

import xml.etree.ElementTree as ET
tree = ET.parse(someXmlFile)    

try:
    xmlFile = ET.tostring(tree.getroot(), encoding='utf8').decode('utf8')
except:
    sys.stderr.write("Wrong xml2 header\n")
    exit(31)

if re.match(r"^\s*<\?xml version=\'1\.0\' encoding=\'utf8\'\?>\s+", xmlFile) is None:
    sys.stderr.write("Wrong xml1 header\n")
    exit(31)

But the ET.tostring() function just "makes up" a header if it is not present in the file.

Is there any way to check for a xml header with ET? Or somehow throw an error while loading the file with ET.parse, if a file does not contain the xml header?

回答1:

tl;dr

from xml.dom.minidom import parseString
def has_xml_declaration(xml):
    return parseString(xml).version

From Wikipedia's XML declaration

If an XML document lacks encoding specification, an XML parser assumes that the encoding is UTF-8 or UTF-16, unless the encoding has already been determined by a higher protocol.

...

The declaration may be optionally omitted because it declares as its encoding the default encoding. However, if the document instead makes use of XML 1.1 or another character encoding, a declaration is necessary. Internet Explorer prior to version 7 enters quirks mode, if it encounters an XML declaration in a document served as text/html

So even if the XML declaration is omitted in an XML document, the code-snippet:

if re.match(r"^<\?xml\s*version=\'1\.0\' encoding=\'utf8\'\s*\?>", xmlFile.decode('utf-8')) is None:

will find "the" default XML declaration in this XML document. Please note, that I have used xmlFile.decode('utf-8') instead of xmlFile. If you don't worry to use minidom, you can use the following code-snippet:

from xml.dom.minidom import parse

dom = parse('bookstore-003.xml')
print('<?xml version="{}" encoding="{}"?>'.format(dom.version, dom.encoding))

Here is a working fiddle Int bookstore-001.xml an XML declaration ist present, in bookstore-002.xml no XML declaration ist present and in bookstore-003.xml a different XML declaration than in the first example ist present. The print instruction prints accordingly the version and the encoding:

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="None" encoding="None"?>

<?xml version="1.0" encoding="ISO-8859-1"?>