Assume we have the following html:
<html>
<body>
<a href="/1234.html">TEXT A</a>
<a href="/3243.html">TEXT B</a>
<a href="/7445.html">TEXT C</a>
<body>
</html>
How do I make it find the element "a", which contains "TEXT A"?
So far I've got:
root = lxml.hmtl.document_fromstring(the_html_above)
e = root.find('.//a')
I've tried:
e = root.find('.//a[@text="TEXT A"]')
but that didn't work, as the "a" tags have no attribute "text".
Is there any way I can solve this in a similar fashion to what I've tried?
You are very close. Use
text()=
rather than@text
(which indicates an attribute).Or, if you know only that the text contains "TEXT A",
Or, if you know only that text starts with "TEXT A",
See the docs for more on the available string functions.
For example,
yields
Another way that looks more straightforward to me: