How can I read the header of an XML document in Python 3?
Ideally, I would use the defusedxml module as the documentation states that it's safer, but at this point (after hours of trying to figure this out), I'd settle for any parser.
For example, I have a document (this is actually from an exercise) that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"> <!-- this is root -->
<!-- CONTENTS -->
</plist>
I'm wondering how to access everything before the root node.
This seems like such a general question that I thought I would easily find an answer online, but I guess I was wrong. The closest thing I found was this question on Stack Overflow, which didn't really help (I looked into xml.sax, but couldn't find anything relevant).
I tried minidom
which is vulnerable to billion laughs and quadratic blowup attacks according to link you provided. Here is my code:
from xml.dom.minidom import parse
dom = parse('file.xml')
print('<?xml version="{}" encoding="{}"?>'.format(dom.version, dom.encoding))
print(dom.doctype.toxml())
#or
print(dom.getElementsByTagName('plist')[0].previousSibling.toxml())
#or
print(dom.childNodes[0].toxml())
Output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN' 'http://www.apple.com/DTDs/PropertyList-1.0.dtd'>
<!DOCTYPE plist PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN' 'http://www.apple.com/DTDs/PropertyList-1.0.dtd'>
<!DOCTYPE plist PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN' 'http://www.apple.com/DTDs/PropertyList-1.0.dtd'>
You can use minidom
from defusedxml
. I downloaded that package and just replaced import with from defusedxml.minidom import parse
and code worked with same output.
With the lxml library, you can access document properties via a DocInfo
object.
from lxml import etree
tree = etree.parse('input.xml')
info = tree.docinfo
v, e, d = info.xml_version, info.encoding, info.doctype
print('<?xml version="{}" encoding="{}"?>'.format(v, e))
print(d)
Output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
Try this code !
I am assuming the temporary xml in variable 's' .
I am declare a class of MyParser having a function of XmlDecl to print the XML header & the purpose of second function is to parse the XML header .so first create the parser by using the ParserCreate() function defined in xml.parsers .
Now create the object of MyParser class 'parser' & call the parse function with the object reference.
from xml.parsers import expat
s = """<?xml version='1.0' encoding='iso-8859-1'?>
<book>
<title>Title</title>
<chapter>Chapter 1</chapter>
</book>"""
class MyParser(object):
def XmlDecl(self, version, encoding, standalone):
print ("XmlDecl", version, encoding, standalone)
def Parse(self, data):
Parser = expat.ParserCreate()
Parser.XmlDeclHandler = self.XmlDecl
Parser.Parse(data, 1)
parser = MyParser()
parser.Parse(s)