可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to login my university's server via python, but I'm entirely unsure of how to go about generating the appropriate HTTP POSTs, creating the keys and certificates, and other parts of the process I may be unfamiliar with that are required to comply with the SAML spec. I can login with my browser just fine, but I'd like to be able to login and access other contents within the server using python.
For reference, here is the site
I've tried logging in by using mechanize (selecting the form, populating the fields, clicking the submit button control via mechanize.Broswer.submit(), etc.) to no avail; the login site gets spat back each time.
At this point, I'm open to implementing a solution in whichever language is most suitable to the task. Basically, I want to programatically login to SAML authenticated server.
回答1:
Basically what you have to understand is the workflow behind a SAML authentication process. Unfortunately, there is no PDF out there which seems to really provide a good help in finding out what kind of things the browser does when accessing to a SAML protected website.
Maybe you should take a look to something like this: http://www.docstoc.com/docs/33849977/Workflow-to-Use-Shibboleth-Authentication-to-Sign
and obviously to this: http://en.wikipedia.org/wiki/Security_Assertion_Markup_Language. In particular, focus your attention to this scheme:
What I did when I was trying to understand SAML way of working, since documentation was so poor, was writing down (yes! writing - on the paper) all the steps the browser was doing from the first to the last. I used Opera, setting it in order to not allow automatic redirects (300, 301, 302 response code, and so on), and also not enabling Javascript.
Then I wrote down all the cookies the server was sending me, what was doing what, and for what reason.
Maybe it was way too much effort, but in this way I was able to write a library, in Java, which is suited for the job, and incredibily fast and efficient too. Maybe someday I will release it public...
What you should understand is that, in a SAML login, there are two actors playing: the IDP (identity provider), and the SP (service provider).
A. FIRST STEP: the user agent request the resource to the SP
I'm quite sure that you reached the link you reference in your question from another page clicking to something like "Access to the protected website". If you make some more attention, you'll notice that the link you followed is not the one in which the authentication form is displayed. That's because the clicking of the link from the IDP to the SP is a step for the SAML. The first step, actally.
It allows the IDP to define who are you, and why you are trying to access its resource.
So, basically what you'll need to do is making a request to the link you followed in order to reach the web form, and getting the cookies it'll set. What you won't see is a SAMLRequest string, encoded into the 302 redirect you will find behind the link, sent to the IDP making the connection.
I think that it's the reason why you can't mechanize the whole process. You simply connected to the form, with no identity identification done!
B. SECOND STEP: filling the form, and submitting it
This one is easy. Please be careful! The cookies that are now set are not the same of the cookies above. You're now connecting to a utterly different website. That's the reason why SAML is used: different website, same credentials.
So you may want to store these authentication cookies, provided by a successful login, to a different variable.
The IDP now is going to send back you a response (after the SAMLRequest): the SAMLResponse. You have to detect it getting the source code of the webpage to which the login ends. In fact, this page is a big form containing the response, with some code in JS which automatically subits it, when the page loads. You have to get the source code of the page, parse it getting rid of all the HTML unuseful stuff, and getting the SAMLResponse (encrypted).
C. THIRD STEP: sending back the response to the SP
Now you're ready to end the procedure. You have to send (via POST, since you're emulating a form) the SAMLResponse got in the previous step, to the SP. In this way, it will provide the cookies needed to access to the protected stuff you want to access.
Aaaaand, you're done!
Again, I think that the most precious thing you'll have to do is using Opera and analyzing ALL the redirects SAML does. Then, replicate them in your code. It's not that difficult, just keep in mind that the IDP is utterly different than the SP.
回答2:
Selenium with the headless PhantomJS webkit will be your best bet to login into Shibboleth, because it handles cookies and even Javascript for you.
Installation:
$ pip install selenium
$ brew install phantomjs
from selenium import webdriver
from selenium.webdriver.support.ui import Select # for <SELECT> HTML form
driver = webdriver.PhantomJS()
# On Windows, use: webdriver.PhantomJS('C:\phantomjs-1.9.7-windows\phantomjs.exe')
# Service selection
# Here I had to select my school among others
driver.get("http://ent.unr-runn.fr/uPortal/")
select = Select(driver.find_element_by_name('user_idp'))
select.select_by_visible_text('ENSICAEN')
driver.find_element_by_id('IdPList').submit()
# Login page (https://cas.ensicaen.fr/cas/login?service=https%3A%2F%2Fshibboleth.ensicaen.fr%2Fidp%2FAuthn%2FRemoteUser)
# Fill the login form and submit it
driver.find_element_by_id('username').send_keys("myusername")
driver.find_element_by_id('password').send_keys("mypassword")
driver.find_element_by_id('fm1').submit()
# Now connected to the home page
# Click on 3 links in order to reach the page I want to scrape
driver.find_element_by_id('tabLink_u1240l1s214').click()
driver.find_element_by_id('formMenu:linknotes1').click()
driver.find_element_by_id('_id137Pluto_108_u1240l1n228_50520_:tabledip:0:_id158Pluto_108_u1240l1n228_50520_').click()
# Select and print an interesting element by its ID
page = driver.find_element_by_id('_id111Pluto_108_u1240l1n228_50520_:tableel:tbody_element')
print page.text
Note:
- during development, use Firefox to preview what you are doing
driver = webdriver.Firefox()
- this script is provided as-is and with the corresponding links, so you can compare each line of code with the actual source code of the pages (until login at least).
回答3:
Extending the answer from Stéphane Bruckert above, once you have used Selenium to get the auth cookies, you can still switch to requests if you want to:
import requests
cook = {i['name']: i['value'] for i in driver.get_cookies()}
driver.quit()
r = requests.get("https://protected.ac.uk", cookies=cook)
回答4:
You can find here a more detailed description of the Shibboleth authentication process.
回答5:
I wrote a simple Python script capable of logging into a Shibbolized page.
First, I used Live HTTP Headers in Firefox to watch the redirects for the particular Shibbolized page I was targeting.
Then I wrote a simple script using urllib.request
(in Python 3.4, but the urllib2
in Python 2.x seems to have the same functionality). I found that the default redirect following of urllib.request
worked for my purposes, however I found it nice to subclass the urllib.request.HTTPRedirectHandler
and in this subclass (class ShibRedirectHandler
) add a handler for all the http_error_302 events.
In this subclass I just printed out values of the parameters (for debugging purposes); please note that in order to utilize the default redirect following, you need to end the handler with return HTTPRedirectHandler.http_error_302(self, args...)
(i.e. a call to the base class http_errror_302 handler.)
The most important component to make urllib
work with Shibbolized Authentication is to create OpenerDirector
which has Cookie handling added. You build the OpenerDirector
with the following:
cookieprocessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(ShibRedirectHandler, cookieprocessor)
response = opener.open("https://shib.page.org")
Here is a full script that may get your started (you will need to change a few mock URLs I provided and also enter valid username and password). This uses Python 3 classes; to make this work in Python2 replace urllib.request with urllib2 and urlib.parse with urlparse:
import urllib.request
import urllib.parse
#Subclass of HTTPRedirectHandler. Does not do much, but is very
#verbose. prints out all the redirects. Compaire with what you see
#from looking at your browsers redirects (using live HTTP Headers or similar)
class ShibRedirectHandler (urllib.request.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print (req)
print (fp.geturl())
print (code)
print (msg)
print (headers)
#without this return (passing parameters onto baseclass)
#redirect following will not happen automatically for you.
return urllib.request.HTTPRedirectHandler.http_error_302(self,
req,
fp,
code,
msg,
headers)
cookieprocessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(ShibRedirectHandler, cookieprocessor)
#Edit: should be the URL of the site/page you want to load that is protected with Shibboleth
(opener.open("https://shibbolized.site.example").read())
#Inspect the page source of the Shibboleth login form; find the input names for the username
#and password, and edit according to the dictionary keys here to match your input names
loginData = urllib.parse.urlencode({'username':'<your-username>', 'password':'<your-password>'})
bLoginData = loginData.encode('ascii')
#By looking at the source of your Shib login form, find the URL the form action posts back to
#hard code this URL in the mock URL presented below.
#Make sure you include the URL, port number and path
response = opener.open("https://test-idp.server.example", bLoginData)
#See what you got.
print (response.read())
回答6:
Mechanize can do the work as well except it doesn't handle Javascript. Authentification successfully worked but once on the homepage, I couldn't load such link:
<a href="#" id="formMenu:linknotes1"
onclick="return oamSubmitForm('formMenu','formMenu:linknotes1');">
In case you need Javascript, better use Selenium with PhantomJS. Otherwise, I hope you will find inspiration from this script:
#!/usr/bin/env python
#coding: utf8
import sys, logging
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
br = mechanize.Browser() # Browser
cj = cookielib.LWPCookieJar() # Cookie Jar
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36')]
br.open('https://ent.unr-runn.fr/uPortal/')
br.select_form(nr=0)
br.submit()
br.select_form(nr=0)
br.form['username'] = 'myusername'
br.form['password'] = 'mypassword'
br.submit()
br.select_form(nr=0)
br.submit()
rs = br.open('https://ent.unr-runn.fr/uPortal/f/u1240l1s214/p/esup-mondossierweb.u1240l1n228/max/render.uP?pP_org.apache.myfaces.portlet.MyFacesGenericPortlet.VIEW_ID=%2Fstylesheets%2Fetu%2Fdetailnotes.xhtml')
# Eventually comparing the cookies with those on Live HTTP Header:
print "Cookies:"
for cookie in cj:
print cookie
# Displaying page information
print rs.read()
print rs.geturl()
print rs.info();
# And that last line didn't work
rs = br.follow_link(id="formMenu:linknotes1", nr=0)
回答7:
I faced a similar problem with my university page SAML authentication as well.
The base idea is to use a requests.session
object to automatically handle most of the http redirects and cookie storing. However, there were many redirects using both javascript as well, and this caused multiple problems using the simple requests solution.
I ended up using fiddler to keep track of every request my browser made to the university server to fill up the redirects I've missed. It really made the process easier.
My solution is far from ideal, but seems to work.
回答8:
Though already answered , hopefully this helps someone.I had a task of downloading files from an SAML Website and got help from Stéphane Bruckert's answer.
If headless is used then the wait time would need to be specified at the required intervals of redirection for login. Once the browser logged in I used the cookies from that and used it with the requests module to download the file - Got help from this.
This is how my code looks like-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options #imports
things_to_download= [a,b,c,d,e,f] #The values changing in the url
options = Options()
options.headless = False
driver = webdriver.Chrome('D:/chromedriver.exe', options=options)
driver.get('https://website.to.downloadfrom.com/')
driver.find_element_by_id('username').send_keys("Your_username") #the ID would be different for different website/forms
driver.find_element_by_id('password').send_keys("Your_password")
driver.find_element_by_id('logOnForm').submit()
session = requests.Session()
cookies = driver.get_cookies()
for things in things_to_download:
for cookie in cookies:
session.cookies.set(cookie['name'], cookie['value'])
response = session.get('https://website.to.downloadfrom.com/bla/blabla/' + str(things_to_download))
with open('Downloaded_stuff/'+str(things_to_download)+'.pdf', 'wb') as f:
f.write(response.content) # saving the file
driver.close()