Scraping text in h3 and p tags using beautifulsoup

2019-07-21 12:55发布

问题:

I have experience with python, BeautifulSoup but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data).

       ...body and not nested divs...       
            <h3 class="college">
                <span class="num">1.</span> <a href="https://www.stanford.edu/">Stanford University</a>
                </h3>
                <div class="he-mod" data-block="paragraph-9"></div>
                <p class="school-location">Stanford, CA</p>
             ...body and not nested divs...
        <h3 id="MIT" class="college">
        <span class="num">2.</span> <a href="https://web.mit.edu/">Massachusetts Institute of Technology (MIT)</a>
        </h3>
        <div class="he-mod" data-block="paragraph-14"></div>
        <p class="school-location">Cambridge, MA</p>
...body and not nested divs... 
    <h3 id="Berkeley" class="college">
    <span class="num">3.</span> <a href="https://www.berkeley.edu/">University of California Berkeley</a>
    </h3>
    <div class="he-mod" data-block="paragraph-19"></div>
    <p class="school-location">Berkeley, CA</p>
...body and not nested divs... 

I wish to get the links and name withing h3 and also text within

(i can do this but not the first part) However with my code I'm only able to get Stanford though I'm find_all(class_='colleges')

My Code

    import requests
from bs4 import BeautifulSoup



page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')


college_name_list = soup.find(class_='college')
college_name_list_items = college_name_list.find_all('a')
for college_name in college_name_list_items:
    print(college_name.prettify())

Output

<a href="https://www.stanford.edu/">
    Stanford University
   </a>

I wish to get the other colleges too with same class=college but different id's

Please help me to just get them; i can arrange the .csv myself.

Source Website to be Scraped if you can please tell me what div/class or anythhing i should look to!

回答1:

Try use find_all with <h3> tag and then find <a> then extract the text and href value.

import requests
from bs4 import BeautifulSoup

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')

college_name=[]

college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
  if college.find('a'):
    college_name.append(college.find('a')['href'])
    college_name.append(college.find('a').text)

print(college_name)

Output: In a list format.

['https://www.stanford.edu/', 'Stanford University', 'https://web.mit.edu/', 'Massachusetts Institute of Technology (MIT)', 'https://www.berkeley.edu/', 'University of California Berkeley', 'https://www.harvard.edu/', 'Harvard University', 'https://www.princeton.edu/', 'Princeton University', '/pennsylvania-education/carnegie-mellon-university-online/', 'Carnegie Mellon University', 'https://www.utexas.edu/', 'The University of Texas at Austin', 'https://www.cornell.edu/', 'Cornell University', 'https://www.ucla.edu/', 'University of California, Los Angeles (UCLA)', '/california-education/university-southern-california-online/', 'University of Southern California', 'https://www.caltech.edu/', 'California Institute of Technology (Caltech)', 'https://www.utoronto.ca/', 'University of Toronto', 'https://illinois.edu/', 'University of Illinois at Urbana-Champaign', 'https://ucsd.edu/', 'University of California in San Diego', 'https://www.umich.edu/', 'University of Michigan', 'https://www.umd.edu/', 'University of Maryland, College Park', 'https://www.ethz.ch/en.html', 'Swiss Federal Institute of Technology', 'https://www.technion.ac.il/en/home-2/', 'Technion-Israel Institute of Technology', 'https://www.osu.edu/', 'Ohio State University', 'https://english.tau.ac.il/', 'Tel Aviv University', '/indiana-education/purdue-university-online/', 'Purdue University', 'https://www.gatech.edu/', 'Georgia Institute of Technology', 'https://www.cam.ac.uk/', 'University of Cambridge', 'https://www.ntu.edu.tw/english/', 'National Taiwan University', 'http://ac.cs.tsinghua.edu.cn', 'Tsinghua University', 'https://www.imperial.ac.uk/', 'The Imperial College of Science, Technology, and Medicine', 'https://www.kau.edu.sa/home_english.aspx', 'King Abdulaziz University', 'https://www.tum.de/en/homepage/', 'Technical University Munich', 'https://uci.edu/', 'University of California, Irvine', 'https://www.ucdavis.edu/', 'University of California, Davis', 'https://www.columbia.edu/', 'Columbia University', '/online-colleges/arizona-state-university-online/', 'Arizona State University', 'https://www.ntu.edu.sg/Pages/home.aspx', 'Nanyang Technological University', 'https://www.ox.ac.uk/', 'University of Oxford', '/online-colleges/northwestern-university-online/', 'Northwestern University', 'https://www.epfl.ch/en/home/', 'Swiss Federal Institute of Technology Lausanne', 'https://www.nyu.edu/', 'New York University', 'https://www.kau.edu.sa/home_english.aspx', 'The Chinese University of Hong Kong', '/north-carolina-education/university-north-carolina-online/', 'University of North Carolina at Chapel Hill', 'https://www.ust.hk/', 'The Hong Kong University of Science and Technology', 'https://twin-cities.umn.edu/', 'University of Minnesota, Twin Cities', 'https://www.zju.edu.cn/english/', 'Zhejiang University', 'https://www.ugr.es/en/', 'University of Granada', 'https://www.ucl.ac.uk/', 'University College London', 'https://www.cityu.edu.hk/', 'City University of Hong Kong', 'https://www.ubc.ca/', 'University of British Columbia', 'https://www.nd.edu/', 'University of Notre Dame', 'http://www.nus.edu.sg/', 'The National University of Singapore', 'http://en.sjtu.edu.cn/', 'Shanghai Jiao Tong University', 'https://www.yale.edu/', 'Yale University', 'https://www.washington.edu/', 'University of Washington', '/north-carolina-education/duke-university-online/', 'Duke University', 'https://www.colorado.edu/', 'University of Colorado at Boulder', 'https://www.ku.dk/english/', 'University of Copenhagen', 'https://www.ucsb.edu/', 'University of California, Santa Barbara', 'https://www.manchester.ac.uk/', 'University of Manchester', 'https://newbrunswick.rutgers.edu/', 'Rutgers University', 'https://www.rice.edu/', 'Rice University', 'https://www.kuleuven.be/english/', 'KU Leuven', 'https://www.utah.edu/', 'University of Utah', 'https://msu.edu/', 'Michigan State University', 'https://www.tamu.edu/', 'Texas A&M University', 'http://english.pku.edu.cn/', 'Peking University', 'https://www.psu.edu/', 'Pennsylvania State University - University Park', 'https://www.udel.edu/', 'University of Delaware', 'http://en.xjtu.edu.cn/', 'Xian Jiao Tong University', 'http://english.hust.edu.cn/', 'Huazhong University of Science and Technology', 'http://en.hit.edu.cn/', 'Harbin Institute of Technology', 'https://www.sfu.ca/', 'Simon Fraser University', 'https://www.polyu.edu.hk/web/en/home/', 'The Hong Kong Polytechnic University', 'https://www.tue.nl/en/', 'Eindhoven University of Technology', 'https://www.nctu.edu.tw/index.php/en', 'National Chiao Tung University', 'https://en.xidian.edu.cn/', 'Xidian University', 'https://www.ujaen.es/serv/vicint/home/index', 'University of Jaen', 'https://www.kaust.edu.sa/en', 'King Abdullah University of Science and Technology', 'https://www.jhu.edu/', 'Johns Hopkins University', 'https://www.upenn.edu/', 'University of Pennsylvania', 'https://www.wisc.edu/', 'University of Wisconsin', 'https://www.ed.ac.uk/home', 'The University of Edinburgh', 'https://www.mcgill.ca/', 'McGill University', 'https://www.bristol.ac.uk/', 'University of Bristol', 'https://new.huji.ac.il/en', 'The Hebrew University of Jerusalem', 'https://www.ugent.be/en', 'Ghent University', 'https://www.brown.edu/', 'Brown University', 'https://www.weizmann.ac.il/pages/', 'Weizmann Institute of Science', 'https://www.unsw.edu.au/', 'University of New South Wales', 'https://www.ualberta.ca/', 'University of Alberta', 'https://www.southampton.ac.uk/', 'University of Southampton', 'https://www.dtu.dk/english', 'Technical University of Denmark', 'https://en.uniroma1.it/', 'Sapienza University of Rome', 'https://en.ustc.edu.cn/', 'The University of Science and Technology of China', 'https://www.uic.edu/', 'University of Illinois at Chicago', 'https://www.hku.hk/', 'University of Hong Kong', 'https://uwaterloo.ca/', 'University of Waterloo', 'https://www.kaist.edu/html/en/', 'Korea Advanced Institute of Science and Technology', 'https://www.uh.edu/', 'University of Houston', 'http://en.dlut.edu.cn/', 'Dalian University of Technology', 'https://en.whu.edu.cn/', 'Wuhan University', '/online-colleges/new-jersey-institute-technology-online/', 'New Jersey Institute of Technology']

However you can use pandas dataframe and import all data into csv format.

to install pandas you can simply run via command line.

pip install pandas

And use the below code.

import requests
from bs4 import BeautifulSoup
import pandas as pd

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')
college_name=[]
college_name_url=[]
college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
  if college.find('a'):
    college_name_url.append(college.find('a')['href'])
    college_name.append(college.find('a').text)
df = pd.DataFrame({"college_name":college_name,"college_name_url":college_name_url})
df.to_csv('college_name.csv')

Your csv file will be like that.



回答2:

Please, try this code:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')college_name_list = soup.find_all(class_='college')
college_name_list_items =[]
for i in college_name_list:
    college_name_list_items.append(i.find_all('a'))


for college_name in college_name_list_items:
    print(college_name)


回答3:

You need get h3 with class="college":

import requests

list_colleges = {}

result = requests.get('https://www.stanford.edu/')
if (result.status_code == 200):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(result.content)
    colleges = soup.findAll('h3', {'class': 'colleges'})
    for college in colleges:
        id_college = college.get('id')
        if not (id_college is None):
            list_colleges[id] = college # Store the inner html