How can I parse table data from website using Sele

2019-08-26 00:05发布

Im trying to parse the table present in the [website][1]

[1]: http://www.espncricinfo.com/rankings/content/page/211270.html using selenium, as I am beginner . i'm struggling to do that here is my code

from bs4 import BeautifulSoup
import time
from selenium import webdriver

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()

browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

print(len(soup.find_all("table")))
print(soup.find("table", {"class": "expanded_standings"}))

browser.close()
browser.quit()

that I tried, I'm unable to fetch anything from this, any suggestions will be really helpful thanks

2条回答
Rolldiameter
2楼-- · 2019-08-26 00:36

It looks like that page's tables are within iframes. If you have a specific table you want to scrape, try inspecting it using browser dev tools (right click, inspect element in Chrome) and find the iframe element that is wrapping it. The iframe should have a src attribute that holds a url to the page that actually contains that table. You can then use a similar method to the one you tried but instead use the src url.

Selenium can also "jump into" an iframe if you know how to find the iframe in the page's source code. frame = browser.find_element_by_id("the_iframe_id") browser.switch_to.frame(frame) html = browser.page_source etc

查看更多
霸刀☆藐视天下
3楼-- · 2019-08-26 00:45

The table you are after is within an iframe. So, to get the data from that table you need to switch that iframe first and then do the rest. Here is one way you could do it:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
wait = WebDriverWait(driver, 10)
 ## if any different table you expect to have then just change the index number within nth-of-type()
 ## and the appropriate name in the selector
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[name='testbat']:nth-of-type(1)")))
for table in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table tr")))[1:]:
    data = [item.text for item in table.find_elements_by_css_selector("th,td")]
    print(data)
driver.quit()

And the best approach would be in this very case is as follows. No browser simulator is used. Only requests and BeautifulSoup have been used:

import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
soup = BeautifulSoup(res.text,"lxml")
 ## if any different table you expect to have then just change the index number 
 ## and the appropriate name in the selector
item = soup.select("iframe[name='testbat']")[0]['src']
req = requests.get(item)
sauce = BeautifulSoup(req.text,"lxml")
for items in sauce.select("table tr"):
    data = [item.text for item in items.select("th,td")]
    print(data)

Partial results:

['Rank', 'Name', 'Country', 'Rating']
['1', 'S.P.D. Smith', 'AUS', '947']
['2', 'V. Kohli', 'IND', '912']
['3', 'J.E. Root', 'ENG', '881']
查看更多
登录 后发表回答