Scrapy - LinkExtractor in control flow and why it

2019-08-19 10:32发布

问题:

I'm trying to understand why my LinkExtractor doesn't work and when it is actually running in the crawl loop?

This is the page I'm crawling.

  • There are 25 listings on each page and their links are parsed in parse_page
  • Then each crawled link are parsed in parse_item

This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages.

I think my Rule and LinkExtractor are correct because when I tried to allow all links, it didn't work either.

My Questions are;

  • When are the LinkExtractors are supposed to run in this script and why they are not running?
  • How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors?
  • How can I implement the parse_page with the LinkExtractor?

This is my spider's relevant parts.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)),
             callback='parse_page',
             follow=True),
    )


    def __init__(self):
        super().__init__()
        self.allowed_domains = ['yenibiris.com']

        self.start_urls = [
            'https://www.yenibiris.com/is-ilanlari?q=yazilim',
        ]


    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                method='GET',
                callback=self.parse_page
            )

    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

        # crawling the item without any problem here

        yield item

回答1:

I hate to answer my own question, but I think I figured it out. When I define the start_requests function, I might be overriding the rules behavior, so it didn't work. When I remove the __init__ and start_requests functions, spider works as intended.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    start_urls = [
        'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
    ]

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
    )


    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

       # crawling the item without any problem here 

        yield item


回答2:

It seems like your rule and LinkExtractor is correctly defined. However, I don't understand why you define both start_requests() and start_urls. If you don't override start_requests() and override only start_urls, parent class' start_request() generates requests for URL's in the start_urls attribute. So, one of them is redundant in your case. Also, __init__ definition is wrong. It should be like this :

def __init__(self,*args,**kwargs):
    super(YenibirisSpider,self).__init__(*args,**kwargs)
    ...

When are the LinkExtractors are supposed to run in this script and why they are not running ?

LinkExtractor extracts links from corresponding response when it is received.

How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors

The regex .*&sayfa=\d+ in the LinkExtractor is appropriate for the webpage. It should work after you fix the mistakes in your code as it is expected.

How can I implement the parse_page with the LinkExtractor?

I don't understand what you mean here.