I am trying to scrape tables from the following webpage using BeautifulSoup:
https://www.pro-football-reference.com/boxscores/201702050atl.htm
import requests
from bs4 import BeautifulSoup
url = 'https://www.pro-football-
reference.com/boxscores/201702050atl.htm'
page = requests.get(url)
html = page.text
Most of the tables on the page are inside comment tags, so can't be accessed in a straightforward way.
print(soup.table.text)
returns:
1
2
3
4
OT
Final
via Sports Logos.net
About logos
New England Patriots
0
3
6
19
6
34
via Sports Logos.net
About logos
Atlanta Falcons
0
21
7
0
0
28
i.e. the main tables containing the player stats are missing. I have tried to simply remove the comment tags using
html = html.replace('<!--',"")
html = html.replace('-->',"")
but to no avail. How can I access these commented-out tables?
Here you go. You can get any table from that page only changing the index number.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm').text
soup = BeautifulSoup(page,'lxml')
table = soup.find_all('table')[1] #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
for rowdata in table.find_all("tr")]
for data in tab_data:
print(' '.join(data))
As the other tables except for the first two are within javascript, that is why you need to use selenium to gatecrash and parse them. You will definitely be able to access any table from that page now. Here is the modified one.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
table = soup.find_all('table')[7] #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
for rowdata in table.find_all("tr")]
for data in tab_data:
print(' '.join(data))
I'm able to parse the tables using Beautiful Soup and Pandas, here is some code to help you out.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.pro-football-reference.com/boxscores/201702050atl.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[1]
# Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]
df
now contains this:
Quarter Time Tm Detail NWE ATL
0 2 12:15 Falcons Devonta Freeman 5 yard rush (Matt Bryant kick) 0 7
1 NaN 8:48 Falcons Austin Hooper 19 yard pass from Matt Ryan (Mat... 0 14
2 NaN 2:21 Falcons Robert Alford 82 yard interception return (Mat... 0 21
3 NaN 0:02 Patriots Stephen Gostkowski 41 yard field goal 3 21
4 3 8:31 Falcons Tevin Coleman 6 yard pass from Matt Ryan (Matt... 3 28
In case anyone else is interested in grabbing tables from comments without using selenium.
You can grab all the comments, then check if a table is present and pass that text back to BeautifulSoup to parse the table.
import requests
from bs4 import BeautifulSoup, Comment
r = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'html.parser')
for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
if comment.find("<table ") > 0:
comment_soup = BeautifulSoup(comment, 'html.parser')
table = comment_soup.find("table")
Would probably be wise to make this a little more robust to ensure the entire table exists within the same comment.