Deep parse with beautifulsoup

2019-04-16 19:57发布

I try to parse https://www.drugbank.ca/drugs. The idea is to extract all the drug names and some additional informationfor each drug. As you can see each webpage represents a table with drug names and the when we hit the drugname we can access to this drug information. Let's say I will keep the following code to handle the pagination:

import requests
from bs4 import BeautifulSoup

def drug_data():
url = 'https://www.drugbank.ca/drugs/'

while url:
    print(url)
    r = requests.get(url)
    soup = BeautifulSoup(r.text ,"lxml")

    #data = soup.select('name-head a')
    #for link in data:
    #    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
    #    pages_data(href)

    # next page url
    url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
    print(url)
    if url:
        url = 'https://www.drugbank.ca' + url[0].get('href')
    else:
        break

  drug_data()

The issue is that in each page, and for each drug in the table of this page I need to capture : Name. Accession Number. Structured Indications, Generic Prescription Products,

I used the classical request/beautifusoup but can't go deep ..

Some Help please

2条回答
Lonely孤独者°
2楼-- · 2019-04-16 20:19

To crawl effectively, you'll want to implement a few measures, such as maintaining a queue of urls to visit and be aware of what urls you have already visited.

Keeping in mind that links can be absolute or relative and that redirects are very likely, you also probably want to construct the urls dynamically rather than string concatenation.

Here is a generic (we usually only want to use example.com on SO) crawling workflow...

from urllib.parse import urljoin, urlparse # python
# from urlparse import urljoin, urlparse # legacy python2
import requests
from bs4 import BeautifulSoup
def process_page(soup):
    '''data extraction process'''
    pass

def is_external(link, base='example.com'):
    '''determine if the link is external to base'''
    site = urlparse(link).netloc
    return base not in site

def resolve_link(current_location, href):
    '''resolves final location of a link including redirects'''
    req_loc = urljoin(current_location, href)
    response = requests.head(req_loc)
    resolved_location = response.url # location after redirects
    # if you don't want to visit external links...
    if is_external(resolved_location):
        return None
    return resolved_location

url_queue = ['https://example.com']
visited = set()
while url_queue:
    url = url_queue.pop() # removes a url from the queue and assign it to `url`
    response = requests.get(url)
    current_location = response.url # final location after redirects
    visited.add(url) # note that we've visited the given url
    visited.add(current_location) # and the final location
    soup = BeautifulSoup(response.text, 'lxml')
    process_page(soup) # scrape the page
    link_tags = soup.find_all('a') # gather additional links
    for anchor in link_tags:
        href = anchor.get('href')
        link_location = resolve_link(current_location, href)
        if link_location and link_location not in visited:
            url_queue.append(link_location)
查看更多
霸刀☆藐视天下
3楼-- · 2019-04-16 20:21

Create function with requests and BeautifulSoup to get data from subpage

import requests
from bs4 import BeautifulSoup

def get_details(url):
    print('details:', url)

    # get subpage
    r = requests.get(url)
    soup = BeautifulSoup(r.text ,"lxml")

    # get data on subpabe
    dts = soup.findAll('dt')
    dds = soup.findAll('dd')

    # display details
    for dt, dd in zip(dts, dds):
        print(dt.text)
        print(dd.text)
        print('---')

    print('---------------------------')

def drug_data():
    url = 'https://www.drugbank.ca/drugs/'

    while url:
        print(url)
        r = requests.get(url)
        soup = BeautifulSoup(r.text ,"lxml")

        # get links to subpages
        links = soup.select('strong a')
        for link in links:
            # exeecute function to get subpage
            get_details('https://www.drugbank.ca' + link['href'])

        # next page url
        url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
        print(url)
        if url:
            url = 'https://www.drugbank.ca' + url[0].get('href')
        else:
            break

drug_data()
查看更多
登录 后发表回答