I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:
import pandas as pd
page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)
print(df)
I have the following response:
Traceback (most recent call last):
File "weather_station_scrapping.py", line 11, in <module>
result = pd.read_html(page_link)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
displayed_only=displayed_only)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No tables found
Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")
elem = driver.find_element_by_id("history_table")
head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')
list_rows = []
for items in body.find_element_by_tag_name('tr'):
list_cells = []
for item in items.find_elements_by_tag_name('td'):
list_cells.append(item.text)
list_rows.append(list_cells)
driver.close()
Now, the problem is that it cannot find "tr". I would appreciate any suggestions.
You can use
requests
and avoid opening browser.You can get current conditions by using:
https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15
and strip of
'jQuery1720724027235122559_1542743885014('
from the left and')'
from the right. Then handle the json string.You can get summary and history by calling the API with the following
https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276
You then need to strip
'jQuery1720724027235122559_1542743885015('
from the front and');'
from the right. You then have a JSON string you can parse.Sample of JSON:
You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.
An example for
current
, noting there seems to be a problem withnulls
in the JSON so I am replacing with"placeholder"
:Here's a solution using selenium for browser automation
Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:
After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)
Then, from that element, we get the HTML instead of the web driver element object
We use pandas to parse the html
From the docs:
So we index into that list with the only table we have, at index zero