Python urllib2 automatic form filling and retrieva

2019-02-03 13:14发布

问题:

I'm looking to be able to query a site for warranty information on a machine that this script would be running on. It should be able to fill out a form if needed ( like in the case of say HP's service site) and would then be able to retrieve the resulting web page.

I already have the bits in place to parse the resulting html that is reported back I'm just having trouble with what needs to be done in order to do a POST of data that needs to be put in the fields and then being able to retrieve the resulting page.

回答1:

If you absolutely need to use urllib2, the basic gist is this:

import urllib
import urllib2
url = 'http://whatever.foo/form.html'
form_data = {'field1': 'value1', 'field2': 'value2'}
params = urllib.urlencode(form_data)
response = urllib2.urlopen(url, params)
data = response.read()

If you send along POST data (the 2nd argument to urlopen()), the request method is automatically set to POST.

I suggest you do yourself a favor and use mechanize, a full-blown urllib2 replacement that acts exactly like a real browser. A lot of sites use hidden fields, cookies, and redirects, none of which urllib2 handles for you by default, where mechanize does.

Check out Emulating a browser in Python with mechanize for a good example.



回答2:

Using urllib and urllib2 together,

data = urllib.urlencode([('field1',val1), ('field2',val2)]) # list of two-element tuples
content = urllib2.urlopen('post-url', data)

content will give you the page source.



回答3:

I’ve only done a little bit of this, but:

  1. You’ve got the HTML of the form page. Extract the name attribute for each form field you need to fill in.
  2. Create a dictionary mapping the names of each form field with the values you want submit.
  3. Use urllib.urlencode to turn the dictionary into the body of your post request.
  4. Include this encoded data as the second argument to urllib2.Request(), after the URL that the form should be submitted to.

The server will either return a resulting web page, or return a redirect to a resulting web page. If it does the latter, you’ll need to issue a GET request to the URL specified in the redirect response.

I hope that makes some sort of sense?