可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.

I'd like to replace XML tags with a single whitespace.

回答1:

You asked for lxml:

reslist = list(root.iter())
result = ' '.join([element.text for element in reslist])

Or:

result = ''
for element in root.iter():
    result += element.text + ' '
result = result[:-1] # Remove trailing space

回答2:

Using stdlib xml.etree

import xml.etree.ElementTree as ET

tree = ET.parse('sample.xml') 
print(ET.tostring(tree, encoding='utf-8', method='text'))

回答3:

I really like BeautifulSoup, and would rather not use regex on HTML if we can avoid it.

Adapted from: [this StackOverflow Answer], [BeautifulSoup documentation]

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)

Though of course, you can (and should) use BeautifulSoup to navigate the page for what you are looking for.

回答4:

A solution that doesn't require an external library like BeautifulSoup, using the built-in sax parsing framework:

from xml import sax

class MyHandler(sax.handler.ContentHandler):
    def parse(self, filename):
        self.text = []
        sax.parse(filename, self)
        return ''.join(self.text)

    def characters(self, data):
        self.text.append(data)

result = MyHandler().parse("yourfile.xml")

If you need all whitespace intact in the text, also define the ignorableWhitespace method in the handler class in the same way characters is defined.

回答5:

This very problem is actually an example in the lxml tutorial, which suggests using one of the following XPath expressions to get all the bits of text content from the document as a list of strings:

root.xpath("string()")
root.xpath("//text()")

You'll then want to join these bits of text together into a single big string, with str.join probably using str.strip to get rid of leading and trailing whitespace on each bit and ignoring bits that are made entirely of whitespace:

>>> from lxml import etree
>>> root = etree.fromstring("""
... <node>
...   some text
...   <inner_node someattr="someval">   </inner_node>
...   <inner_node>
...     foo bar
...   </inner_node>
...   yet more text
...   <inner_node />
...   even more text
... </node>
... """)
>>> bits_of_text = root.xpath('//text()')
>>> print(bits_of_text)  # Note that some bits are whitespace-only
['\n  some text\n  ', '   ', '\n  ', '\n    foo bar\n  ', '\n  yet more text\n  ', '\n  even more text\n']
>>> joined_text = ' '.join(
...     bit.strip() for bit in bits_of_text
...     if bit.strip() != ''
... )
>>> print(joined_text)
some text foo bar yet more text even more text

Note, by the way, that if you don't want to insert spaces between the bits of text you can just do

etree.tostring(root, method='text', encoding='unicode')

And if you're dealing with HTML instead of XML, and are using lxml.html to parse your HTML, you can just call the .text_content() method of your root node to get all the text it contains (although, again, no spaces will be inserted):

>>> import lxml.html
>>> root = lxml.html.document_fromstring('<p>stuff<p>more <br><b>stuff</b>bla')
>>> root.text_content()
'stuffmore stuffbla'

Get all text from an XML document?

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

收藏的人(0)

Get all text from an XML document?

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮