With Python 2.7 I'm scraping with urllib2 and when some Xpath is needed, lxml as well. It's fast, and because I rarely have to navigate around the sites, this combination works well. On occasion though, usually when I reach a page that will only display some valuable data when a short form is filled in and a submit button is clicked (example), the scraping-only approach with urllib2 is not sufficient.
Each time such a page were encountered, I could invoke selenium.webdriver
to refetch the page and do the form-filling and clicking, but this will slow things down considerably.
NOTE: This question is not about the merits or limitations of urllib2, about which I aware there have been many discussions. It's instead focussed only on finding a fast, headless approach to form-filling etc. (one that will also allow for XPath queries if needed).
In addition to the ones alecxe mentioned, another alternative is to use a GUI browser tool such as Firefox's Web Console to inspect the POST that is made when you click the submit button. Sometimes you can find the POST data and simply spoof it. For example, using the example url you posted, if you
you will obtain something like
(Note that the Web Console menus vary a bit depending on your version of Firefox, so YMMV.) Then you can spoof the POST using code such as:
which yields
If you inspect the HTML and search for
field_36[]
you'll findfrom which it is easy to surmise that
field_36[]
controls theContract Type
and value73
corresponds to "Permanent",74
corresponds to "Temporary", etc. Similarly you can figure out the options forfield_37[]
,field_32[]
andall
(which can be any search term string). If you have a good understanding of HTML, you may not even need the browser tool to construct the POST.There are several things you can consider using:
mechanize
robobrowser
selenium
with a headless browser, likePhantomJS
, for example, or using a regular browser, but in a Virtual Display