I need to scrape query results from an .aspx web page.
http://legistar.council.nyc.gov/Legislation.aspx
The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.
Somebody out there must know how to do this.
Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.
What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.
Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.
"Assume we need to select "all years" and "all types" from the respective dropdown menus."
What do these options do to the URL that is ultimately submitted.
After all, it amounts to an HTTP request sent via
urllib2
.Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.
Select '"all years" and "all types" from the respective dropdown menus'
Note the URL which is actually submitted.
Use this URL in
urllib2
.Selenium is a great tool to use for this kind of task. You can specify the form values that you want to enter and retrieve the html of the response page as a string in a couple of lines of python code. Using Selenium you might not have to do the manual work of simulating a valid post request and all of its hidden variables, as I found out after much trial and error.
The code in the other answers was useful; I never would have been able to write my crawler without it.
One problem I did come across was cookies. The site I was crawling was using cookies to log session id/security stuff, so I had to add code to get my crawler to work:
Add this import:
Init the cookie stuff:
Install
CookieJar
so that it is used as the defaultCookieProcessor
in the default opener handler:To see what cookies the site is using:
This saves the cookies:
As an overview, you will need to perform four main tasks:
The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup
The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).
This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.
That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.
This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...