Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not sure I see why yet. Any hints would be appreciated
Okay I am not sure exactly how to provide an example without making you handle a file:
here is some code I wrote to try to figure out why I was not getting some text I expected:
theTree=html.fromstring(open(notmatched[0]).read())
text=[]
text_content=[]
notText=[]
hasText=[]
for each in theTree.iter():
if each.text:
text.append(each.text)
hasText.append(each) # list of elements that has text each.text is true
text_content.append(each.text_content()) #the text for all elements
if each not in hasText:
notText.append(each)
So after I run this I look at
>>> len(notText)
3612
>>> notText[40]
<Element b at 26ab650>
>>> notText[40].text_content()
'(I.R.S. Employer'
>>> notText[40].text
Accordng to the docs the
text_content
method:So for example,
doc
is theElement
a
. Thea
tag has no text immediately following it (between the<a>
and the<b>
. Sodoc.text
isNone
:but there is text after the
c
tag, sodoc.text_content()
is notNone
:PS. There is a clear description of the meaning of the
text
attribute here. Although it is part of the docs forlxml.etree.Element
, I think the meaning of thetext
andtail
attributes applies equally well tolxml.html.Element
objects.You maybe confusing different and incompatible interfaces that
lxml
implements -- thelxml.etree
items have a.text
attribute, while (for example) those from lxml.html implement thetext_content
method (and those from BeautifulSoup, also included inlxml
, have a.string
attribute... sometimes [[only nodes with a single child which is a string...]]).Yeah, it is inherently confusing that
lxml
chooses both to implement its own interfaces and emulate or include other libraries, but it can be convenient...;-).