Scrapy - LinkExtractor in control flow and why it

I'm trying to understand why my LinkExtractor doesn't work and when it is actually running in the crawl loop?

This is the page I'm crawling.

There are 25 listings on each page and their links are parsed in parse_page
Then each crawled link are parsed in parse_item

This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages.

I think my Rule and LinkExtractor are correct because when I tried to allow all links, it didn't work either.

My Questions are;

When are the LinkExtractors are supposed to run in this script and why they are not running?
How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors?
How can I implement the parse_page with the LinkExtractor?

This is my spider's relevant parts.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)),
             callback='parse_page',
             follow=True),
    )


    def __init__(self):
        super().__init__()
        self.allowed_domains = ['yenibiris.com']

        self.start_urls = [
            'https://www.yenibiris.com/is-ilanlari?q=yazilim',
        ]


    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                method='GET',
                callback=self.parse_page
            )

    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

        # crawling the item without any problem here

        yield item

标签： python web-scraping scrapy web-crawler

2条回答

叛逆

2楼-- · 2019-08-19 10:20

I hate to answer my own question, but I think I figured it out. When I define the start_requests function, I might be overriding the rules behavior, so it didn't work. When I remove the __init__ and start_requests functions, spider works as intended.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    start_urls = [
        'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
    ]

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
    )


    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

       # crawling the item without any problem here 

        yield item

0人赞添加讨论(0) 举报

爷的心禁止访问

3楼-- · 2019-08-19 10:30

It seems like your rule and LinkExtractor is correctly defined. However, I don't understand why you define both start_requests() and start_urls. If you don't override start_requests() and override only start_urls, parent class' start_request() generates requests for URL's in the start_urls attribute. So, one of them is redundant in your case. Also, __init__ definition is wrong. It should be like this :

def __init__(self,*args,**kwargs):
    super(YenibirisSpider,self).__init__(*args,**kwargs)
    ...

When are the LinkExtractors are supposed to run in this script and why they are not running ?

LinkExtractor extracts links from corresponding response when it is received.

How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors

The regex .*&sayfa=\d+ in the LinkExtractor is appropriate for the webpage. It should work after you fix the mistakes in your code as it is expected.

How can I implement the parse_page with the LinkExtractor?

I don't understand what you mean here.

0人赞添加讨论(0) 举报

Scrapy - LinkExtractor in control flow and why it

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间