Though I'm not particularly advanced at any of this, I've had some past success in using urrlib2, requests and scrapy but this has me stumped. So after much searching and banging my head against the keyboard, I'll just go ahead and ask.
I'd like to get the html source code of a site but after using my username and password, I keep getting a page thrown back which says my username and password are wrong. They work fine in the browser, and once logged in the source code is readily available (via browser). But I can't seem to achieve the same result via python/terminal. I'll include some of my attempts (gleamed from the these helpful pages) below:
using urllib2:
req = Request(website, headers={ 'User-Agent': 'Mozilla/5.0' })
base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
req.add_header("Authorization", "Basic %s" % base64string)
readweb = urlopen(req).read()
another version:
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, theurl, username, password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
pagehandle = opener.open(theurl)
return pagehandle.read()
and an attempt using requests:
r = requests.session()
try:
r.post(theurl, data={'username' : 'username', 'password' : 'password', 'remember':'1'})
except:
print('Sorry, Unable to...')
result = r.get(theurl)
return result.text
I've also tried to use scrapy, but regardless of which library I use it comes back with the html of a page which says my password/details are wrong. I'm guessing it's something to do with the headers/authorisation(?) I'm sending, but I'm not overly sure. Any help much appreciated, please let me know what other details I can update with (I've been up half the night with this, so if this post doesn't make sense please forgive me!)
EDIT:
Here's the traceback response to Prashant's answer below (minus the passwords etc.):
Traceback (most recent call last):
File "/Users/Hatsaw/newpy/pras.py", line 3, in r = requests.get(URL, auth=('username','password')) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.9.0-py2.7.egg/requests/api.py", line 67, in get return request('get', url, params=params, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.9.0-py2.7.egg/requests/api.py", line 53, in request return session.request(method=method, url=url, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.9.0-py2.7.egg/requests/sessions.py", line 468, in request resp = self.send(prep, **send_kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.9.0-py2.7.egg/requests/sessions.py", line 576, in send r = adapter.send(request, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.9.0-py2.7.egg/requests/adapters.py", line 437, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='website', port=80): Max retries exceeded with url: /dashboard/ (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))
EDIT:
Ok, I'm now using mechanize (recommended below), and here's what I'm getting back (not sure if this is another instance of my root problem or my inability with mechanize!):
Traceback (most recent call last):
File "/Users/Hatsaw/newpy/pras2.py", line 13, in browser.form['email'] = 'email address' File "build/bdist.macosx-10.6-intel/egg/mechanize/_form.py", line 2780, in setitem File "build/bdist.macosx-10.6-intel/egg/mechanize/_form.py", line 3101, in find_control File "build/bdist.macosx-10.6-intel/egg/mechanize/_form.py", line 3185, in _find_control mechanize._form.ControlNotFoundError: no control matching name 'email'
EDIT:
Still struggling with this, so here's a last ditch effort before time runs out on this project and I have to go in and get all the html manually! Fingers crossed..
Ok, so on the advice of barny, I'm back to using requests, and I'm attempting to provide the post with cookie information that I've gleamed from a successful browser login. I'm not certain I'm doing this correctly, but I'm using:
cookies = {'PHPSESSID':'5udcifi6p43ma3h1fnpfqghiu0'}
result = sess.get(the_url, cookies=cookies)
Now, at the moment, I'm getting an Internal Server Error response. After some research, aspnet forms seems to be the problem:
- Sending an ASP.net POST with Python's Requests
- Using Python Requests for ASP.NET authentication
I just want to check that I'm not doing something wrong with requests first, then perhaps I'll explore BeautifulSoup/robobrowser as recommended by Martijn Pieters in the SO link above.
Here's what the form section of the html is asking:
<form name="aspnetForm" method="post" action="" id="aspnetForm">
<div>
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATEFIELDCOUNT" id="__VIEWSTATEFIELDCOUNT" value="2" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTkwNzg1NTQ3OA9kFgJmD2QWAmYPZBYGAgetc." />
<input type="hidden" name="__VIEWSTATE1" id="__VIEWSTATE1" value="ZyBBIEhvbWUVIE5lZ290aWF0ZSBBZ3JlZW1lbnRzEiBSZetc." />
</div>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
<script src="/WebResource.axd?d=t2SAOwDGkbrEfkmUaMOR9sPLXqgxfeenNayRja3DNK2R8JEcH-StTTuiaqXpzp--PAISn3vzVbWQ7biREwPkibCmbAE1&t=635586505120000000" type="text/javascript"></script>
<script src="/ScriptResource.axd?d=EL6tXtJfNfGSoQwhYtVnYEqw4oKvuwBBI4etc." type="text/javascript"></script>
<script type="text/javascript">
//<![CDATA[
if (typeof(Sys) === 'undefined') throw new Error('ASP.NET Ajax client-side framework failed to load.');
//]]>
</script>
<script src="/ScriptResource.axd?d=qCmNMcECQa0tfmMcZdwJeeOdcyetc." type="text/javascript"></script>
<div>
<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="FC5C7135" />
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEdABB2xJRvPLCcg6GsBqRFCtw6Xg91QEu10etc." />
</div>
So. Some small questions.
Does my 'user/pass' terminology have to match the source code i.e username = username or user?: I've lost where I found this in the html now, but I found 'ctl00$cphMain$tbUsername' and 'ctl00$cphMain$tbPassword'…
Do I need to send the password and/or username as a base64.encodestring? (I don't know if this is a problem, but the password contains chars such as !@$ etc.)
Do I need to add ALL of the cookie fields I've found from the browser or just the PHPSESSID? Here are the fields I've got in the cookies:
ASP.NET_SessionId, CFID, CFTOKEN, __atuvc, __utma, __utmb, __utmc, __utmt, __utmz, BRO_CALLME, BRO_ID, BRO_LOGIN, BRO_MEMBER, BROAUTH, ISFULLMEMBER, phpMBLink, __CT_Data, WRUID
- There is the website (www.website.com), the login-page (www.website.com/login), and then the content (www.website.com/content). Am I correct in thinking I use the cookie from the (successfully logged in) login-page and 'send' it to the content page? Should I do this manually (enter field details from browser cookie information) or within the code (so, in code below I would use: cookies = r_login.cookies)?
Finally, here's the code I'm currently using that returns an Internal Server Error..:
import requests
the_url = 'the_url'
login = the_url + '/login'
content = the_url + '/content'
username = 'username'
password = 'password'
sess = requests.Session()
sess.auth = ('username', 'password')
sess.get(the_url)
payload = {'ctl00$cphMain$tbUsername': username, 'ctl00$cphMain$tbPassword': password}
r_login = sess.post(login, data=payload)
cookies = {'PHPSESSID':'5udcifi6p43ma3h1fnpfqghiu0', 'ASP.NET_SessionId':'aspnet', 'BRO_LOGIN':'bro_login'}
r_data = s.get(content, cookies=cookies, data=payload)
print r_data.text
Apologies, this has gotten rather long now, if I need to split it up over several posts please let me know - what I assumed was a simple question at the outset has mutated into something else!