login to website for scraping with python

2020-08-04 09:36发布

问题:

I need to get links to genetic pathways from a website. First I need to login but am having trouble. I have very little experience with scraping so any pointers or general 'how to' information about this will be very much appreciated along with an exact answer.

import requests
from bs4 import BeautifulSoup
URL = 'http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP:BIOCARTA'
session1 = requests.Session()
params = {'login':'my_email'}
session2 = session1.post(URL, data=params)

pathways_links = []

for link in soup.find('div', attrs={'id':'wrapper'}).find(
    'div', attrs={'id':'contentwrapper'}).find(
        'div', attrs={'id':'content_navs'}).find(
            'table', attrs={'id':'geneSetTable'}).find('a')['href']:
    pathways_links.append(link)
    print link

unfortunately it doesn't seem to log me in. I get:

'div', attrs={'id':'content_navs'}).find(
 AttributeError: 'NoneType' object has no attribute 'find'

if I ask it to print links before the 'content_navs' div then I get:

<div id="content_full">
<h1>Login to GSEA/MSigDB</h1>
<h2>Login</h2>
<a href="register.jsp"></a>Click here</div>

Any solutions would be much appreciated. thanks.

回答1:

You need to first login at http://www.broadinstitute.org/gsea/login.jsp and then go to the other location.

The first step, is to create a session object; which will persist cookies and other session details. Next, you need to login and then finally pass the contents to BeautifulSoup:

s = requests.Session()
data = {'j_username': 'you@email.com'}
s.post('http://www.broadinstitute.org/gsea/login.jsp', data=data)
r = s.get('http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP:BIOCARTA')
soup = BeautifulSoup(r.content)

# the rest of your code