Get all text from an XML document?

2019-02-19 17:50发布

问题:

How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.

I'd like to replace XML tags with a single whitespace.

回答1:

You asked for lxml:

reslist = list(root.iter())
result = ' '.join([element.text for element in reslist]) 

Or:

result = ''
for element in root.iter():
    result += element.text + ' '
result = result[:-1] # Remove trailing space


回答2:

Using stdlib xml.etree

import xml.etree.ElementTree as ET

tree = ET.parse('sample.xml') 
print(ET.tostring(tree, encoding='utf-8', method='text'))


回答3:

I really like BeautifulSoup, and would rather not use regex on HTML if we can avoid it.

Adapted from: [this StackOverflow Answer], [BeautifulSoup documentation]

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)

Though of course, you can (and should) use BeautifulSoup to navigate the page for what you are looking for.



回答4:

A solution that doesn't require an external library like BeautifulSoup, using the built-in sax parsing framework:

from xml import sax

class MyHandler(sax.handler.ContentHandler):
    def parse(self, filename):
        self.text = []
        sax.parse(filename, self)
        return ''.join(self.text)

    def characters(self, data):
        self.text.append(data)

result = MyHandler().parse("yourfile.xml")

If you need all whitespace intact in the text, also define the ignorableWhitespace method in the handler class in the same way characters is defined.



回答5:

This very problem is actually an example in the lxml tutorial, which suggests using one of the following XPath expressions to get all the bits of text content from the document as a list of strings:

  • root.xpath("string()")
  • root.xpath("//text()")

You'll then want to join these bits of text together into a single big string, with str.join probably using str.strip to get rid of leading and trailing whitespace on each bit and ignoring bits that are made entirely of whitespace:

>>> from lxml import etree
>>> root = etree.fromstring("""
... <node>
...   some text
...   <inner_node someattr="someval">   </inner_node>
...   <inner_node>
...     foo bar
...   </inner_node>
...   yet more text
...   <inner_node />
...   even more text
... </node>
... """)
>>> bits_of_text = root.xpath('//text()')
>>> print(bits_of_text)  # Note that some bits are whitespace-only
['\n  some text\n  ', '   ', '\n  ', '\n    foo bar\n  ', '\n  yet more text\n  ', '\n  even more text\n']
>>> joined_text = ' '.join(
...     bit.strip() for bit in bits_of_text
...     if bit.strip() != ''
... )
>>> print(joined_text)
some text foo bar yet more text even more text

Note, by the way, that if you don't want to insert spaces between the bits of text you can just do

etree.tostring(root, method='text', encoding='unicode')

And if you're dealing with HTML instead of XML, and are using lxml.html to parse your HTML, you can just call the .text_content() method of your root node to get all the text it contains (although, again, no spaces will be inserted):

>>> import lxml.html
>>> root = lxml.html.document_fromstring('<p>stuff<p>more <br><b>stuff</b>bla')
>>> root.text_content()
'stuffmore stuffbla'


标签: python xml lxml