I have an xml doc that I am trying to parse using Etree.lxml
<Envelope xmlns="http://www.example.com/zzz/yyy">
<Header>
<Version>1</Version>
</Header>
<Body>
some stuff
<Body>
<Envelope>
My code is:
path = "path to xml file"
from lxml import etree as ET
parser = ET.XMLParser(ns_clean=True)
dom = ET.parse(path, parser)
dom.getroot()
When I try to get dom.getroot() I get:
<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>
However I only want:
<Element Envelope at 28adacac>
When i do
dom.getroot().find("Body")
I get nothing returned. However, when I
dom.getroot().find("{http://www.example.com/zzz/yyy}Body")
I get a result.
I thought passing ns_clean=True to the parser would prevent this.
Any ideas?
Try using Xpath:
Taken (and simplified) from this page, under "The xpath() method" section
You're showing the result of the repr() call. When you programmatically move through the tree, you can simply choose to ignore the namespace.
You can find namespace-aware nodes using the
xpath
method:If you really want to remove namespaces, you could use an XSL transformation:
Here we see the namespace has been removed:
So you can now find the Body node this way:
The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17 can help you to avoid namespaces with little effort
In your case, try
xml.replace(' xmlns="', ' xmlnamespace="')
. However, you might need something more complex if the string is expected in the bodies as well.