When parsing html why do I need item.text sometime

2019-06-17 04:58发布

Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not sure I see why yet. Any hints would be appreciated

Okay I am not sure exactly how to provide an example without making you handle a file:

here is some code I wrote to try to figure out why I was not getting some text I expected:

theTree=html.fromstring(open(notmatched[0]).read()) 
text=[]
text_content=[]
notText=[]
hasText=[]
for each in theTree.iter():
    if each.text:
        text.append(each.text)
        hasText.append(each)   # list of elements that has text each.text is true
    text_content.append(each.text_content()) #the text for all elements 
    if each not in hasText:
        notText.append(each)

So after I run this I look at

>>> len(notText)
3612
>>> notText[40]
<Element b at 26ab650>
>>> notText[40].text_content()
'(I.R.S. Employer'
>>> notText[40].text

2条回答
ゆ 、 Hurt°
2楼-- · 2019-06-17 05:46

Accordng to the docs the text_content method:

Returns the text content of the element, including the text content of its children, with no markup.

So for example,

import lxml.html as lh
data = """<a><b><c>blah</c></b></a>"""
doc = lh.fromstring(data)
print(doc)
# <Element a at b76eb83c>

doc is the Element a. The a tag has no text immediately following it (between the <a> and the <b>. So doc.text is None:

print(doc.text)
# None

but there is text after the c tag, so doc.text_content() is not None:

print(doc.text_content())
# blah

PS. There is a clear description of the meaning of the text attribute here. Although it is part of the docs for lxml.etree.Element, I think the meaning of the text and tail attributes applies equally well to lxml.html.Element objects.

查看更多
ゆ 、 Hurt°
3楼-- · 2019-06-17 05:53

You maybe confusing different and incompatible interfaces that lxml implements -- the lxml.etree items have a .text attribute, while (for example) those from lxml.html implement the text_content method (and those from BeautifulSoup, also included in lxml, have a .string attribute... sometimes [[only nodes with a single child which is a string...]]).

Yeah, it is inherently confusing that lxml chooses both to implement its own interfaces and emulate or include other libraries, but it can be convenient...;-).

查看更多
登录 后发表回答