When I search for the existence of data in text() of an element using contains, it works for plain data but not when there are carriage returns, new lines/tags in the element content. How to make //td[contains(text(), "")]
work in this case? Thank you!
XML :
<table>
<tr>
<td>
Hello world <i> how are you? </i>
Have a wonderful day.
Good bye!
</td>
</tr>
<tr>
<td>
Hello NJ <i>, how are you?
Have a wonderful day.</i>
</td>
</tr>
</table>
Python :
>>> tdout=open('tdmultiplelines.htm', 'r')
>>> tdouthtml=lh.parse(tdout)
>>> tdout.close()
>>> tdouthtml
<lxml.etree._ElementTree object at 0x2aaae0024368>
>>> tdouthtml.xpath('//td/text()')
['\n Hello world ', '\n Have a wonderful day.\n Good bye!\n ', '\n Hello NJ ', '\n ']
>>> tdouthtml.xpath('//td[contains(text(),"Good bye")]')
[] ##-> But *Good bye* is already in the `td` contents, though as a list.
>>> tdouthtml.xpath('//td[text() = "\n Hello world "]')
[<Element td at 0x2aaae005c410>]
Use:
//td[text()[contains(.,'Good bye')]]
Explanation:
The reason for the problem is not that a text node's string value is a multiline string -- the real reason is that the td
element has more than one text-node children.
In the provided expression:
//td[contains(text(),"Good bye")]
the first argument passed to the function contains()
is a node-set of more than one text nodes.
As per XPath 1.0 specification (in XPath 2.0 this simply raises a type error), a the evaluation of a function that expects a string argument but is passed a node-set instead, takes the string value only of the 1st node in the node-set.
In this specific case, the first text node of the passed node-set has string value:
"
Hello world "
so the comparison fails and the wanted td
element isn't selected.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<table>
<tr>
<td>
Hello world <i> how are you? </i>
Have a wonderful day.
Good bye!
</td>
</tr>
<tr>
<td>
Hello NJ <i>, how are you?
Have a wonderful day.</i>
</td>
</tr>
</table>
the XPath expression is evaluated and the selected nodes (in this case just one) are copied to the output:
<td>
Hello world <i> how are you? </i>
Have a wonderful day.
Good bye!
</td>
Use .
instead of text()
:
tdouthtml.xpath('//td[contains(.,"Good bye")]')