Given the following (simplified from a larger document)
<tr class="row-class">
<td>Age</td>
<td>16</td>
</tr>
<tr class="row-class">
<td>Height</td>
<td>5.6</td>
</tr>
<tr class="row-class">
<td>Weight</td>
<td>103.4</td>
</tr>
I have tried to return the 16
from the appropriate row using bs4
and lxml
. The issue seems to be that there is a Navigable String
between the two td
tags, so that
page.find_all("tr", {"class":"row-class"})
yields a result set with
result[0] = {Tag} <tr class="row-class"> <td>Age</td> <td>16</td> </tr>
result[1] = {Tag} <tr class="row-class"> <td>Height</td> <td>5.6</td> </tr>
result[2] = {Tag} <tr class="row-class"> <td>Weight</td> <td>103.4</td> </tr>
which is great, but I can't get the string in the second td
. The contents of each of these rows is similar to
[' ', <td>Age</td>, ' ', <td>16</td>, ' ']
with the td
being a tag
and the ' '
being a Navigable String
. This difference is preventing me from using the next_element
or next_sibling
convenience methods to access the correct text with something like:
If I use:
find("td", text=re.compile(r'Age')).get_text()
I get Age
. But if I try to access the next element via
find("td", text=re.compile(r'Age')).next_element()
I get
'NavigableString' object is not callable
Because of the wrapping NavigableStrings
in the result
, moving backwards with previous_element
has the same problem.
How do I move from the found Tag
to the next Tag
, skipping the next_element
in between? Is there a way to remove these ' '
from the result
?
I should point out that I've already tried to be pragmatic with something like:
for r in (sp.find_all("tr", {"class":"row-class"})):
age = r.find("td", text=re.compile(r"\d\d")).get_text()
it works ... until I parse a document that has another order with a matching \d\d
before Age
.
I know, also, that I can
find("td", text=re.compile(r'Age')).next_sibling.next_sibling
but that is hard-baking the structure in.
So I need to be specific in the search and find the td
that has the target string, then find the value in the next td
. I know I could build some logic that tests each row, but it seems like I'm missing something obvious and more elegant...
if you get list of elements then you can use
[index]
to get element from list.result