scrapy can't crawl all links in a page

2019-03-02 13:43发布

I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free

I want to get all the links directing to each game.

I inspect the element of the page. And it looks like this: how the page looks like so I want to extract all links with the pattern /store/apps/details?id=

but when I ran commands in the shell, it returns nothing: shell command

I've also tried //a/@href. didn't work out either but Don't know what is wrong going on....

  • Now I can crawl first 120 links with starturl modified and 'formdata' added as someone told me but no more links after that.

Can someone help me with this?

1条回答
放我归山
2楼-- · 2019-03-02 14:14

It's actually an ajax-post-request which populates the data on that page. In scrapy shell, you won't get this, instead of inspect element check the network tab there you will find the request.

Make post request to https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0 url with formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}

Increment start by 60 on each request to get the paginated result.

查看更多
登录 后发表回答