I'm trying to understand why my LinkExtractor
doesn't work and when it is actually running in the crawl loop?
This is the page I'm crawling.
- There are 25 listings on each page and their links are parsed in
parse_page
- Then each crawled link are parsed in
parse_item
This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages.
I think my Rule
and LinkExtractor
are correct because when I tried to allow all links, it didn't work either.
My Questions are;
- When are the
LinkExtractors
are supposed to run in this script and why they are not running? - How can I make the spider follow to the next pages, parse the pages and parse the items in them with
LinkExtractors
? - How can I implement the
parse_page
with theLinkExtractor
?
This is my spider's relevant parts.
class YenibirisSpider(CrawlSpider):
name = 'yenibirisspider'
rules = (
Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)),
callback='parse_page',
follow=True),
)
def __init__(self):
super().__init__()
self.allowed_domains = ['yenibiris.com']
self.start_urls = [
'https://www.yenibiris.com/is-ilanlari?q=yazilim',
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
method='GET',
callback=self.parse_page
)
def parse_page(self, response):
items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
for item in items:
yield scrapy.Request(
url=item,
method='GET',
callback=self.parse_items
)
def parse_items(self, response):
# crawling the item without any problem here
yield item
I hate to answer my own question, but I think I figured it out. When I define the
start_requests
function, I might be overriding therules
behavior, so it didn't work. When I remove the__init__
andstart_requests
functions, spider works as intended.It seems like your
rule
andLinkExtractor
is correctly defined. However, I don't understand why you define both start_requests() and start_urls. If you don't overridestart_requests()
and override onlystart_urls
, parent class'start_request()
generates requests for URL's in thestart_urls
attribute. So, one of them is redundant in your case. Also,__init__
definition is wrong. It should be like this :LinkExtractor extracts links from corresponding response when it is received.
The regex
.*&sayfa=\d+
in the LinkExtractor is appropriate for the webpage. It should work after you fix the mistakes in your code as it is expected.I don't understand what you mean here.