-->

Using BeautifulSoup to scrape tables within commen

2020-08-09 04:14发布

问题:

I am trying to scrape tables from the following webpage using BeautifulSoup: https://www.pro-football-reference.com/boxscores/201702050atl.htm

import requests
from bs4 import BeautifulSoup

url = 'https://www.pro-football-
reference.com/boxscores/201702050atl.htm'
page = requests.get(url)
html = page.text

Most of the tables on the page are inside comment tags, so can't be accessed in a straightforward way.

print(soup.table.text)

returns:

1
2
3
4
OT
Final







via Sports Logos.net
About logos


New England Patriots
0
3
6
19 
6
34





via Sports Logos.net
About logos


Atlanta Falcons
0
21
7
0
0
28

i.e. the main tables containing the player stats are missing. I have tried to simply remove the comment tags using

html = html.replace('<!--',"")
html = html.replace('-->',"")

but to no avail. How can I access these commented-out tables?

回答1:

Here you go. You can get any table from that page only changing the index number.

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm').text

soup = BeautifulSoup(page,'lxml')
table = soup.find_all('table')[1]  #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
                        for rowdata in table.find_all("tr")]
for data in tab_data:
    print(' '.join(data))

As the other tables except for the first two are within javascript, that is why you need to use selenium to gatecrash and parse them. You will definitely be able to access any table from that page now. Here is the modified one.

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
table = soup.find_all('table')[7]  #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
                        for rowdata in table.find_all("tr")]
for data in tab_data:
    print(' '.join(data))


回答2:

I'm able to parse the tables using Beautiful Soup and Pandas, here is some code to help you out.

import requests
from bs4 import BeautifulSoup
import pandas as pd    

url = 'https://www.pro-football-reference.com/boxscores/201702050atl.htm'
page = requests.get(url)

soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[1]
# Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]

df now contains this:

    Quarter Time    Tm  Detail  NWE ATL
0   2   12:15   Falcons Devonta Freeman 5 yard rush (Matt Bryant kick)  0   7
1   NaN 8:48    Falcons Austin Hooper 19 yard pass from Matt Ryan (Mat...   0   14
2   NaN 2:21    Falcons Robert Alford 82 yard interception return (Mat...   0   21
3   NaN 0:02    Patriots    Stephen Gostkowski 41 yard field goal   3   21
4   3   8:31    Falcons Tevin Coleman 6 yard pass from Matt Ryan (Matt...   3   28


回答3:

In case anyone else is interested in grabbing tables from comments without using selenium.

You can grab all the comments, then check if a table is present and pass that text back to BeautifulSoup to parse the table.

import requests
from bs4 import BeautifulSoup, Comment

r = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')

if r.status_code == 200:
    soup = BeautifulSoup(r.content, 'html.parser')

    for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
        if comment.find("<table ") > 0:
            comment_soup = BeautifulSoup(comment, 'html.parser')
            table = comment_soup.find("table")

Would probably be wise to make this a little more robust to ensure the entire table exists within the same comment.