可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a xml file as follows
<Person>
<name>
My Name
</name>
<Address>My Address</Address>
</Person>
The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml.
I found this but it trims only which are between tags not the value
https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/
Update 1 - Handle following xml which has tail spaces in <name>
tag
<Person>
<name>
My Name<shortname>My</short>
</name>
<Address>My Address</Address>
</Person>
Accepted answer handle above both kind of xml's
Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings
https://stackoverflow.com/a/19396130/973699
回答1:
With lxml
you can iterate over all elements and check if it has text to strip()
:
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
print(etree.tostring(root))
It yields:
<Person><name>My Name</name>
<Address>My Address</Address>
</Person>
UPDATE to strip tail
text too:
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
print(etree.tostring(root, encoding="utf-8", xml_declaration=True))
回答2:
Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file.
Following code did what I wanted
from lxml import etree
#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)
root = etree.parse('xmlfile',myparser)
#from Birei's answer
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)
回答3:
You have to do xml parsing for this one way or another, so maybe use xml.sax
and copy to the output stream at each event (skipping ignorableWhitespace
), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.
回答4:
You can use beautifulsoup. Do traverse all elements and for each one that contains some text, replace it with its stripped version:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')
for elem in soup.find_all():
if elem.string is not None:
elem.string = elem.string.strip()
print(soup)
Assuming xmlfile
with the content provided in the question, it yields:
<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>
回答5:
I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. To show an answer that's greatly backwards compatible, I've written this with xml.dom
and xml.minidom
functions.
import codecs
from xml.dom import minidom
# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")
# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")
# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")
stripped_file.close()
While it's not BeautifulSoup
, this solution is pretty elegant and uses the full force of the lower-level API. Note that the actual formatting is just one line :)
Documentation of API calls used here:
minidom.parse
minidom.Node.writexml
codecs.open