I need to get links to genetic pathways from a website. First I need to login but am having trouble. I have very little experience with scraping so any pointers or general 'how to' information about this will be very much appreciated along with an exact answer.
import requests
from bs4 import BeautifulSoup
URL = 'http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP:BIOCARTA'
session1 = requests.Session()
params = {'login':'my_email'}
session2 = session1.post(URL, data=params)
pathways_links = []
for link in soup.find('div', attrs={'id':'wrapper'}).find(
'div', attrs={'id':'contentwrapper'}).find(
'div', attrs={'id':'content_navs'}).find(
'table', attrs={'id':'geneSetTable'}).find('a')['href']:
pathways_links.append(link)
print link
unfortunately it doesn't seem to log me in. I get:
'div', attrs={'id':'content_navs'}).find(
AttributeError: 'NoneType' object has no attribute 'find'
if I ask it to print links before the 'content_navs' div then I get:
<div id="content_full">
<h1>Login to GSEA/MSigDB</h1>
<h2>Login</h2>
<a href="register.jsp"></a>Click here</div>
Any solutions would be much appreciated. thanks.