Python unable to retrieve form with urllib or mech

2019-01-26 13:53发布

I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems.

The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php. The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php.

First of all, this is the urllib/urllib2 method I've tried:

import urllib, urllib2
import socket, cookielib

url = 'http://zrs.leidenuniv.nl/ul/start.php'
params = {'day': 1, 'month': 5, 'year': 2012, 'quickselect' : "unchecked",
          'res_instantie': '_ALL_', 'selgebouw': '_ALL_', 'zrssort': "locatie",
          'submit' : "Uitvoeren"}
http_header = {  "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11",
                 "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                 "Accept-Language" : "nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4" }

timeout = 15
socket.setdefaulttimeout(timeout)

request = urllib2.Request(url, urllib.urlencode(params), http_header)
response = urllib2.urlopen(request)

cookies = cookielib.CookieJar()
cookies.extract_cookies(response, request)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()

opener = urllib2.build_opener(redirect_handler, cookie_handler)

response = opener.open(request)
html = response.read()

However, when I try to print the retrieved html I get the original page, not the one the form action refers to. So any hints as to why this doesn't submit the form would be greatly appreciated.

Because the above didn't work, I also tried to use mechanize to submit the form. However, this results in a ParseError with the following code:

import mechanize

url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)

where the last line exits with the following: "ParseError: unexpected '-' char in declaration". Now I realize that this error may indicate an error in the DOCTYPE declaration, but since I can't edit the form page I'm not able to try different declarations. Any help on this error is also greatly appreciated.

Thanks in advance for your help.

1条回答
聊天终结者
2楼-- · 2019-01-26 14:34

It's because the DOCTYPE part is malformed.

Also it contains some strange tags like:

<!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden><!e-mail j.dreef@law.leidenuniv.nl >

Try validating the page yourself...


Nonetheless, you can just strip off the junk to make mechanizes html parser happy:

import mechanize

url = 'http://zrs.leidenuniv.nl/ul/start.php'

br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177:])
br.set_response(response)

br.select_form(nr = 0)
查看更多
登录 后发表回答