Recovering from HTTPError in Mechanize

2020-06-03 00:30发布

问题:

I am writing a function for some existing python code that will be passed a Mechanize browser object as a parameter.

I fill in some details in a form in the browser, and use response = browser.submit() to move the browser to a new page, and collect some information from it.

Unfortunately, I occasionally get the following error:

httperror_seek_wrapper: HTTP Error 500: Internal Server Error

 

I've navigated to the page in my own browser, and sure enough, I occasionally see this error directly, so I think this is a server problem, not anything to do with robots.txt, headers or similar.

The problem is that after submitting, the state of the browser object changes and I can't continue to use it. My first thought was to try taking a deep copy first and use that if I ran into problems, but that gives the error TypeError: object.__new__(cStringIO.StringO) is not safe, use cStringIO.StringO.__new__() as described here.

I've also tried using browser.back() but get NoneType errors.

Does anyone have a good solution to this?

 

Solution (with thanks to karnesJ.R below):

A great solution below uses the excellent requests library (docs here). requests has functionality to fill in a form and submit via post or get, which importantly doesn't change the state of the br object.

An excellent website allows us to test various error codes, and has a form interface at the top that I've tested this on. I create a br object at this site, then define a function that selects the form from br, pulls out the relevant information, but does the submit via requests - so that the br object hasn't changed and is re-usable. Error codes cause requests to return rubbish, but don't render the br unusable.

As stated below, this involves a little more setup time, but is well worth it.

import mechanize
import requests

def testErrorCodes(br,theCodes):
    for x in theCodes:

        br.select_form(nr=0)

        theAction = br.action
        payload = {'code': x}

        response = requests.post(theAction, data=payload)
        print response.status_code

br=mechanize.Browser()
br.set_handle_robots(False)
response = br.open("http://savanttools.com/test-http-status-codes")

testErrorCodes(br,[401,402,403,404,500,503,504]) # Prints the error codes 

testErrorCodes(br,[404]) # The browser is still alive and well to be used again!

回答1:

It's been a while since I've written for python, but I think I have a workaround for your problem. Try this method:

import requests
except Mechanize.HTTPError:
    while true: ## DANGER ##
        ## You will need to format and/or decode the POST for your form
        response = requests.post('http://yourwebsite.com/formlink', data=None, json=None)
        ## If the server will accept JSON formatting, this becomes trivial
        if response.status_code == accepted_code: break

You can find documentation about the requests library here. I personally think that requests is better for your case than mechanize... but it does require a little more overhead from you in that you need to break down the submission to raw POST using some kind of RESTful interceptor in your browser.

Ultimately though, by passing in br you are restricting yourself to the way that mechanize handles browser states on br.submit().



回答2:

I'm assuming that you want the submission to happen even if it takes multiple tries.

The solution that I thought of is certainly not efficient, but it should work.

def do_something_in_mechanize():
    <...insert your code here...>
    try:
        browser.submit()
        <...rest of your code...>
    except mechanize.HTTPError:
        do_something_in_mechanize()

Basically, it'll call the function until the action is performed without HTTPErrors.