lxml etree xmlparser remove unwanted namespace

I have an xml doc that I am trying to parse using Etree.lxml

<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  <Body>
<Envelope>

My code is:

path = "path to xml file"
from lxml import etree as ET
parser = ET.XMLParser(ns_clean=True)
dom = ET.parse(path, parser)
dom.getroot()

When I try to get dom.getroot() I get:

<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>

However I only want:

<Element Envelope at 28adacac>

When i do

dom.getroot().find("Body")

I get nothing returned. However, when I

dom.getroot().find("{http://www.example.com/zzz/yyy}Body")

I get a result.

I thought passing ns_clean=True to the parser would prevent this.

Any ideas?

标签： python lxml xml-parsing elementtree

4条回答

你好瞎i

2楼-- · 2019-01-06 11:18

Try using Xpath:

dom.xpath("//*[local-name() = 'Body']")

Taken (and simplified) from this page, under "The xpath() method" section

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2019-01-06 11:28

You're showing the result of the repr() call. When you programmatically move through the tree, you can simply choose to ignore the namespace.

0人赞添加讨论(0) 举报

何必那么认真

4楼-- · 2019-01-06 11:31

import io
import lxml.etree as ET

content='''\
<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  </Body>
</Envelope>
'''    
dom = ET.parse(io.BytesIO(content))

You can find namespace-aware nodes using the xpath method:

body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'})
print(body)
# [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]

If you really want to remove namespaces, you could use an XSL transformation:

# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl
xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<xsl:template match="/|comment()|processing-instruction()">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
</xsl:template>

<xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="."/>
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
dom=transform(dom)

Here we see the namespace has been removed:

print(ET.tostring(dom))
# <Envelope>
#   <Header>
#     <Version>1</Version>
#   </Header>
#   <Body>
#     some stuff
#   </Body>
# </Envelope>

So you can now find the Body node this way:

print(dom.find("Body"))
# <Element Body at 8506cd4>

0人赞添加讨论(0) 举报

Fickle 薄情

5楼-- · 2019-01-06 11:37

The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17 can help you to avoid namespaces with little effort

apply xml.replace(' xmlns:', ' xmlnamespace:') to your xml before using pyquery so lxml will ignore namespaces

In your case, try xml.replace(' xmlns="', ' xmlnamespace="'). However, you might need something more complex if the string is expected in the bodies as well.

0人赞添加讨论(0) 举报

lxml etree xmlparser remove unwanted namespace

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间