可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to understand why my LinkExtractor doesn't work and when it is actually running in the crawl loop?

This is the page I'm crawling.

There are 25 listings on each page and their links are parsed in parse_page
Then each crawled link are parsed in parse_item

This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages.

I think my Rule and LinkExtractor are correct because when I tried to allow all links, it didn't work either.

My Questions are;

When are the LinkExtractors are supposed to run in this script and why they are not running?
How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors?
How can I implement the parse_page with the LinkExtractor?

This is my spider's relevant parts.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)),
             callback='parse_page',
             follow=True),
    )


    def __init__(self):
        super().__init__()
        self.allowed_domains = ['yenibiris.com']

        self.start_urls = [
            'https://www.yenibiris.com/is-ilanlari?q=yazilim',
        ]


    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                method='GET',
                callback=self.parse_page
            )

    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

        # crawling the item without any problem here

        yield item

回答1:

I hate to answer my own question, but I think I figured it out. When I define the start_requests function, I might be overriding the rules behavior, so it didn't work. When I remove the __init__ and start_requests functions, spider works as intended.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    start_urls = [
        'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
    ]

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
    )


    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

       # crawling the item without any problem here 

        yield item

回答2:

It seems like your rule and LinkExtractor is correctly defined. However, I don't understand why you define both start_requests() and start_urls. If you don't override start_requests() and override only start_urls, parent class' start_request() generates requests for URL's in the start_urls attribute. So, one of them is redundant in your case. Also, __init__ definition is wrong. It should be like this :

def __init__(self,*args,**kwargs):
    super(YenibirisSpider,self).__init__(*args,**kwargs)
    ...

When are the LinkExtractors are supposed to run in this script and why they are not running ?

LinkExtractor extracts links from corresponding response when it is received.

How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors

The regex .*&sayfa=\d+ in the LinkExtractor is appropriate for the webpage. It should work after you fix the mistakes in your code as it is expected.