Alternatives to Selenium/Webdriver for filling in

With Python 2.7 I'm scraping with urllib2 and when some Xpath is needed, lxml as well. It's fast, and because I rarely have to navigate around the sites, this combination works well. On occasion though, usually when I reach a page that will only display some valuable data when a short form is filled in and a submit button is clicked (example), the scraping-only approach with urllib2 is not sufficient.

Each time such a page were encountered, I could invoke selenium.webdriver to refetch the page and do the form-filling and clicking, but this will slow things down considerably.

NOTE: This question is not about the merits or limitations of urllib2, about which I aware there have been many discussions. It's instead focussed only on finding a fast, headless approach to form-filling etc. (one that will also allow for XPath queries if needed).

回答1:

There are several things you can consider using:

mechanize
robobrowser
selenium with a headless browser, like PhantomJS, for example, or using a regular browser, but in a Virtual Display

回答2:

In addition to the ones alecxe mentioned, another alternative is to use a GUI browser tool such as Firefox's Web Console to inspect the POST that is made when you click the submit button. Sometimes you can find the POST data and simply spoof it. For example, using the example url you posted, if you

Use Firefox to go to http://apply.ovoenergycareers.co.uk/vacancies/#results
Click Tools > Web Developer > Web Console
Click Net > Log Request and Response Bodies
Fill in the form, click Search
Left-click the (first) POST in the Web Console
Right-click the (first) POST, select COPY POST Data
Paste the POST data in a text editor

you will obtain something like

all
field_36[]=73
field_37[]=76
field_32[]=82
submit=Search

(Note that the Web Console menus vary a bit depending on your version of Firefox, so YMMV.) Then you can spoof the POST using code such as:

import urllib2
import urllib
import lxml.html as LH

url = "http://apply.ovoenergycareers.co.uk/vacancies/#results"
params = urllib.urlencode([('field_36[]', 73), ('field_37[]', 76), ('field_32[]', 82)])
response = urllib2.urlopen(url, params)
content = response.read()
root = LH.fromstring(content)
print('\n'.join([tag.text_content() for tag in root.xpath('//dl')]))

which yields

  Regulatory Data Analyst
          Contract Type
            Permanent
                    Contract Hours
            Full-time
                    Location
            Bristol
                    Department
            Business Intelligence
                    Full description

If you inspect the HTML and search for field_36[] you'll find

<div class="multiwrapper">
<p class="sidenote multi">(Hold the ctrl (pc) or cmd (Mac) keys for multi-selects) </p>
<select class="select-long" multiple size="5" name="field_36[]" id="field_36"><option value="0">- select all -</option>
<option selected value="73" title="Permanent">Permanent</option>
<option value="74" title="Temporary">Temporary</option>
<option value="75" title="Fixed-term">Fixed-term</option>
<option value="81" title="Intern">Intern</option></select>
</div>

from which it is easy to surmise that field_36[] controls the Contract Type and value 73 corresponds to "Permanent", 74 corresponds to "Temporary", etc. Similarly you can figure out the options for field_37[], field_32[] and all (which can be any search term string). If you have a good understanding of HTML, you may not even need the browser tool to construct the POST.