I want scrapy to crawl pages where going on to the next link looks like this:
<a href="#" onclick="return gotoPage('2');"> Next </a>
Will scrapy be able to interpret javascript code of that?
With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:
encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n
I am trying to build my spider on the CrawlSpider
class, but I can't really figure out how to code it, with BaseSpider
I used the parse()
method to process the first URL, which happens to be a login form, where I did a POST with:
def logon(self, response):
login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]
And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?
All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.
I built a quick crawler that executes JS via selenium. Feel free to copy / modify https://github.com/rickysahu/seleniumjscrawl
The actual methodology will be as follows:
All this have to be streamlined with the server response mechanism, e.g:
dont_click = true
inFormRequest.from_response
Now how to figure it all out: Use a web debugger like fiddler or you can use Firefox plugin FireBug, or simply hit F12 in IE 9; and check the requests a user actually makes on the website match the way you are crawling the webpage.