I was trying to implement a lxml, xpath
code to parse html from link: https://www.theice.com/productguide/ProductSpec.shtml?specId=251
Specifically, I was trying to parse the <tr class="last">
table at near the end of the page.
I wanted to obtain the text in that sub-table, for example: "New York" and the hours listed next to it (and do the same for London and Singapore) .
I have the following code (which doesn't work properly):
doc = lxml.html.fromstring(page)
tds = doc.xpath('//table[@class="last"]//table[@id"tradingHours"]/tbody/tr/td/text()')
With BeautifulSoup:
table = soup.find('table', attrs={'id':'tradingHours'})
for td in table.findChildren('td'):
print td.text
What is the best method to achieve this? I want to use lxml
not beautifulSoup
(just to see the difference).
I like css selectors much adaptive on page changes than xpaths:
If the site is proper html, id attributes are unique and you can find the table at
doc.get_element_by_id('tradingHours')
.Results in
Your
lxml
code is very close to working. The main problem is that thetable
tag is not the one with theclass="last"
attribute. Rather, it is atr
tag that has that attribute:Thus,
has no matches. There is also a minor syntax error:
@id"tradingHours"
should be@id="tradingHours"
.You can also omit
//table[@class="last"]
entirely sincetable[@id="tradingHours"]
is specific enough.The closest analog to your BeautifulSoup code would be:
The grouper recipe,
zip(*[iterable]*n)
, is often very useful when parsing tables. It collects the items initerable
into groups ofn
items. We could use it here like this:I'm not terribly good at explaining how the grouper recipe works, but I've made an attempt here.
This page is using JavaScript to reformat the dates. To scrape the page after the JavaScript has altered the contents, you could use selenium:
yields
Note that in this particular case, if you did not want to use selenium, you could use pytz to parse and convert the times yourself: