I am using liburl2 with CookieJar / HTTPCookieProcessor in an attempt to simulate a login to a page to automate an upload.
I've seen some questions and answers on this, but nothing which solves my problem. I am losing my cookie when I simulate the login which ends up at a 302 redirect. The 302 response is where the cookie gets set by the server, but urllib2 HTTPCookieProcessor does not seem to save the cookie during a redirect. I tried creating a HTTPRedirectHandler class to ignore the redirect, but that didn't seem to do the trick. I tried referencing the CookieJar globally to handle the cookies from the HTTPRedirectHandler, but 1. This didn't work (because I was handling the header from the redirector, and the CookieJar function that I was using, extract_cookies, needed a full request) and 2. It's an ugly way to handle it.
I probably need some guidance on this as I'm fairly green with Python. I think I'm mostly barking up the right tree here, but maybe focusing on the wrong branch.
cj = cookielib.CookieJar()
cookieprocessor = urllib2.HTTPCookieProcessor(cj)
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
global cj
cookie = headers.get("set-cookie")
if cookie:
# Doesn't work, but you get the idea
cj.extract_cookies(headers, req)
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
cookieprocessor = urllib2.HTTPCookieProcessor(cj)
# Oh yeah. I'm using a proxy too, to follow traffic.
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8888'})
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor, proxy)
Addition: I had tried using mechanize as well, without success. This is probably a new question, but I'll pose it here since it is the same ultimate goal:
This simple code using mechanize, when used with a 302 emitting url (http://fxfeeds.mozilla.com/firefox/headlines.xml) -- note that the same behavior occurs when not using set_handle_robots(False). I just wanted to ensure that wasn't it:
import urllib2, mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
opener = mechanize.build_opener(*(browser.handlers))
r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")
Output:
Traceback (most recent call last):
File "redirecttester.py", line 6, in <module>
r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")
File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 204, in open
File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 457, in http_response
File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 221, in error
File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 332, in _call_chain
File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 571, in http_error_302
File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 188, in open
File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 71, in http_request
AttributeError: OpenerDirector instance has no attribute '_add_referer_header'
Any ideas?
I was also having the same problem where the server would respond to the login POST request with a 302 and the session token in the Set-Cookie header. Using Wireshark it was clearly visible that urllib was following the redirect but not including the session token in the Cookie.
I literally just ripped out urllib and did a direct replacement with requests and it worked perfectly first time without having to change a thing. Big props to those guys.
I have been having the exact same problem recently but in the interest of time scrapped it and decided to go with
mechanize
. It can be used as a total replacement forurllib2
that behaves exactly as you would expect a browser to behave with regards to Referer headers, redirects, and cookies.The
Browser
object can be used as an opener itself (using the.open()
method). It maintains state internally but also returns a response object on every call. So you get a lot of flexibility.Also, if you don't have a need to inspect the
cookiejar
manually or pass it along to something else, you can omit the explicit creation and assignment of that object as well.I am fully aware this doesn't address what is really going on and why
urllib2
can't provide this solution out of the box or at least without a lot of tweaking, but if you're short on time and just want it to work, just use mechanize.I've just got a variation of the below working for me, at least when trying to read Atom from http://www.fudzilla.com/home?format=feed&type=atom
I can't verify that the below snippet will run as-is, but might give you a start:
Depends on how the redirect is done. If it's done via a HTTP Refresh, then mechanize has a HTTPRefreshProcessor you can use. Try to create an opener like this: