I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks
You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront
Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps