I have an xml doc that I am trying to parse using Etree.lxml
<Envelope xmlns="http://www.example.com/zzz/yyy">
<Header>
<Version>1</Version>
</Header>
<Body>
some stuff
<Body>
<Envelope>
My code is:
path = "path to xml file"
from lxml import etree as ET
parser = ET.XMLParser(ns_clean=True)
dom = ET.parse(path, parser)
dom.getroot()
When I try to get dom.getroot() I get:
<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>
However I only want:
<Element Envelope at 28adacac>
When i do
dom.getroot().find("Body")
I get nothing returned. However, when I
dom.getroot().find("{http://www.example.com/zzz/yyy}Body")
I get a result.
I thought passing ns_clean=True to the parser would prevent this.
Any ideas?
import io
import lxml.etree as ET
content='''\
<Envelope xmlns="http://www.example.com/zzz/yyy">
<Header>
<Version>1</Version>
</Header>
<Body>
some stuff
</Body>
</Envelope>
'''
dom = ET.parse(io.BytesIO(content))
You can find namespace-aware nodes using the xpath
method:
body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'})
print(body)
# [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]
If you really want to remove namespaces, you could use an XSL transformation:
# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl
xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>
<xsl:template match="/|comment()|processing-instruction()">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="@*|node()"/>
</xsl:element>
</xsl:template>
<xsl:template match="@*">
<xsl:attribute name="{local-name()}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''
xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
dom=transform(dom)
Here we see the namespace has been removed:
print(ET.tostring(dom))
# <Envelope>
# <Header>
# <Version>1</Version>
# </Header>
# <Body>
# some stuff
# </Body>
# </Envelope>
So you can now find the Body node this way:
print(dom.find("Body"))
# <Element Body at 8506cd4>
Try using Xpath:
dom.xpath("//*[local-name() = 'Body']")
Taken (and simplified) from this page, under "The xpath() method" section
The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17 can help you to avoid namespaces with little effort
apply xml.replace(' xmlns:', ' xmlnamespace:')
to your xml before using pyquery so lxml will ignore namespaces
In your case, try xml.replace(' xmlns="', ' xmlnamespace="')
. However, you might need something more complex if the string is expected in the bodies as well.
You're showing the result of the repr() call. When you programmatically move through the tree, you can simply choose to ignore the namespace.