I try to parse https://www.drugbank.ca/drugs. The idea is to extract all the drug names and some additional informationfor each drug. As you can see each webpage represents a table with drug names and the when we hit the drugname we can access to this drug information. Let's say I will keep the following code to handle the pagination:
import requests
from bs4 import BeautifulSoup
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
#data = soup.select('name-head a')
#for link in data:
# href = 'https://www.drugbank.ca/drugs/' + link.get('href')
# pages_data(href)
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
The issue is that in each page, and for each drug in the table of this page I need to capture : Name. Accession Number. Structured Indications, Generic Prescription Products,
I used the classical request/beautifusoup but can't go deep ..
Some Help please
To crawl effectively, you'll want to implement a few measures, such as maintaining a queue of urls to visit and be aware of what urls you have already visited.
Keeping in mind that links can be absolute or relative and that redirects are very likely, you also probably want to construct the urls dynamically rather than string concatenation.
Here is a generic (we usually only want to use
example.com
on SO) crawling workflow...Create function with
requests
andBeautifulSoup
to get data from subpage