How to search and replace text in an XML file usin

2020-06-18 10:51发布

问题:

How do I search an entire xml file for a specific text pattern and then replace each occurrence of that text with new text pattern in Python 3.5?

Everything else (format, attributes, comments, etc.) needs to remain as it is in the original xml file.

I am running Python 3.5.1 on Windows (win32).

Specifically, I would like to replace each occurrence of "FEATURE NAME" with "THIS WORKED" and replace each occurrence of "FEATURE NUMBER" with "12345".

I have been trying to learn Python and xml.etree.ElementTree but cannot figure this out. I already looked at "Search and replace a line in a .xml file in Python", "Search and replace a line in a file in Python", and "How to search and replace text in a file using Python?" and other existing Q/A's on this site but cannot figure this out - I'm not an experienced programmer, so please let me know if more input is needed . Your help is greatly appreciated!!!

Here is a copy of what the xml code looks like when I open it in Notepad (except I added spaces to indent each line and hit return for some lines when I pasted it into this question):

<description-topic>
    <access-info>
        <index-term-set>
            <index-term>
                <primary>FID FEATURE NUMBER</primary>
            </index-term>
            <index-term>
                <primary>FEATURE NAME</primary>
            </index-term>
            <index-term>
                <primary>Common features</primary>
                <secondary>FID FEATURE NUMBER</secondary>
            </index-term>
        </index-term-set>
    </access-info>
    <title>FEATURE NUMBER - FEATURE NAME</title>
    <block>
        <label>Platform</label>
        <comment>REVIEWERS: I guessed at the FEATURE NAME</comment>
        <para>
            This feature applies to the following platforms: FEATURE NAME<!--Check the values--></para>
    </block>
    <block branch="no">
        <label>Feature Benefits</label>
        <para>
            <comment>REVIEWERS: What do we put here? See template (link given in review email) for more information.</comment>
        </para>
    </block>
    <block branch="no">
        <label>Dependencies</label>
        <para/>
        <subblock>
            <label>Features</label>
            <comment>What FEATURE NAME do we put here?</comment>
        </subblock>
        <subblock>
            <label>Hardware</label>
            <comment>What FEATURE NAME do we put here?</comment>
            <para>This feature applies to the following: FEATURE NUMBER and text.</para><?Pub Caret -1?>
        </subblock>
        <subblock>
            <label>Dependencies outside the eNodeB</label>
            <comment>What FEATURE NAME do we put here?</comment>
        </subblock>
    </block>
    <block branch="no">
        <label>Impacts</label>
        <comment>REVIEWERS: What FEATURE NUMBER do we put here?</comment>
        <para>
            <comment/>
        </para>
    </block>
</description-topic>

Here is the latest code I am trying to get to work:

from xml.etree import ElementTree as et
tree = et.parse('Atemplate2.xml')
tree.find('description-topic/access-info/index-term-set/index-term/primary/').text = '12345'
tree.write('Atemplate2.xml')

I get the following error: Traceback (most recent call last): File "ajktest18.py", line 15, in tree.find('description-topic/access-info/index-term-set/index-term/primary/').text = '12345'

AttributeError: 'NoneType' object has no attribute 'text'

I would prefer to be able to search and modify any occurrences in the entire file, but I can't figure out how to get to even one specific occurrence of the text I am searching for.

Here is the code I tried to use to find the path:

import xml.etree.ElementTree as ET
tree = ET.parse('Atemplate.xml')
root = tree.getroot()

print(root.tag, root.attrib, root.text)

for child in root:
    print(child.tag, child.attrib, child.text)
for label in root.iter('label'):
    print(label.tag, label.attrib, label.text)
for title in root.iter('title'):
    print(title.attrib)

I also tried the following code:

with open('Atemplate2.xml') as f:
    tree = ET.parse(f)
    root = tree.getroot()

for elem in root.getiterator():
    try:
        elem.text = elem.text.replace('FEATURE NAME', 'THIS WORKED')
        elem.text = elem.text.replace('FEATURE NUMBER', '12345')
    except AttributeError:
        pass

tree.write('output.xml')

but that gives the following error:

File "<pyshell#40>", line 2, in <module>
    tree = ET.parse(f)
File "C:\MyPath\Python35-32\lib\xml\etree\ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
File "C:\ MyPath \Python35-32\lib\xml\etree\ElementTree.py", line 594, in parse
    self._root = parser._parse_whole(source)
File "C:\ MyPath \Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1119: character maps to

# #

FINAL UPDATE - Here is the code that worked for me in the end (thank u, Jarad!):

import lxml.etree as ET
#using lxml instead of xml preserved the comments

#adding the encoding when the file is opened and written is needed to avoid a charmap error
with open('filename.xml', encoding="utf8") as f:
  tree = ET.parse(f)
  root = tree.getroot()


  for elem in root.getiterator():
    try:
      elem.text = elem.text.replace('FEATURE NAME', 'THIS WORKED')
      elem.text = elem.text.replace('FEATURE NUMBER', '123456')
    except AttributeError:
      pass

#tree.write('output.xml', encoding="utf8")
# Adding the xml_declaration and method helped keep the header info at the top of the file.
tree.write('output.xml', xml_declaration=True, method='xml', encoding="utf8")

回答1:

Caveats:

  • I have never worked with the xml.etree.ElementTree library
  • I have never worked with it because I never find myself manipulating XML
  • I don't know if this is the "best" way compared to someone that knows the library in and out
  • Commentors seem set on judging you instead of helping you out

This is a modification from this excellent answer. The thing is, you need to read the XML file in and parse it.

import xml.etree.ElementTree as ET

with open('xmlfile.xml', encoding='latin-1') as f:
  tree = ET.parse(f)
  root = tree.getroot()

  for elem in root.getiterator():
    try:
      elem.text = elem.text.replace('FEATURE NAME', 'THIS WORKED')
      elem.text = elem.text.replace('FEATURE NUMBER', '123456')
    except AttributeError:
      pass

tree.write('output.xml', encoding='latin-1')

Note that you can change the encoding parameter to something else such as: utf-8, cp1252, ISO-8859-1, etc. Really depends on your system and file.