How to use CrawlSpider from scrapy to click a link

I want scrapy to crawl pages where going on to the next link looks like this:

<a href="#" onclick="return gotoPage('2');"> Next </a>

Will scrapy be able to interpret javascript code of that?

With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:

encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:

def logon(self, response):
    login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]

And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?

All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.

标签： javascript python onclick scrapy web-scraping

2条回答

我欲成王，谁敢阻挡

2楼-- · 2020-05-30 03:54

I built a quick crawler that executes JS via selenium. Feel free to copy / modify https://github.com/rickysahu/seleniumjscrawl

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2020-05-30 04:18

The actual methodology will be as follows:

Post your request to reach the page (as you are doing)
Extract link to the next page from that particular response
Simple Request the next page if possible or use FormRequest again in applicable

All this have to be streamlined with the server response mechanism, e.g:

You can try using dont_click = true in FormRequest.from_response
Or you may want to handle the redirection (302) coming from the server (in which case you will have to mention in the meta that you require the handle redirect request also to be sent to callback.)

Now how to figure it all out: Use a web debugger like fiddler or you can use Firefox plugin FireBug, or simply hit F12 in IE 9; and check the requests a user actually makes on the website match the way you are crawling the webpage.

0人赞添加讨论(0) 举报

How to use CrawlSpider from scrapy to click a link

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间