scraping data from wikipedia table

2020-08-08 05:26发布


I'm just trying to scrape data from a wikipedia table into a panda dataframe.

I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".

import requests
website_url = requests.get('').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')

My_table = soup.find('table',{'class':'wikitable sortable'})

links = My_table.findAll('a')

Neighbourhood = []
for link in links:

print (Neighbourhood)

import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)


And it returns only the borough...



You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:

import pandas as pd

df=pd.read_html(url, header=0)[0]


    Postcode    Borough         Neighbourhood
0   M1A         Not assigned    Not assigned
1   M2A         Not assigned    Not assigned
2   M3A         North York      Parkwoods
3   M4A         North York      Victoria Village
4   M5A         Downtown Toronto    Harbourfront


You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:

import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('').text
soup = BeautifulSoup(website_text,'xml')

table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')

data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])

df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()]  # to filter out bad rows


>>> df.head()

  PostalCode           Borough     Neighbourhood
1        M1A      Not assigned      Not assigned
2        M2A      Not assigned      Not assigned
3        M3A        North York         Parkwoods
4        M4A        North York  Victoria Village
5        M5A  Downtown Toronto      Harbourfront


Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source:

If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you. Hope this helps