This question already has an answer here:
- Scrapy: scraping data from Pagination 2 answers
I am trying to scrape data from a page and continue scraping following the pagination link.
The page I am trying to scrape is --> here
# -*- coding: utf-8 -*-
import scrapy
class AlibabaSpider(scrapy.Spider):
name = 'alibaba'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']
def parse(self, response):
for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
item = {
'product_name': products.xpath('.//h2/a/@title').extract_first(),
'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
#'image_url': products.xpath('.//div[@class=""]/').extract_first(),
}
yield item
#Follow the paginatin link
next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
Problem
- The code is not able to follow the pagination link.
How can you help
- Modify the code to follow the pagination link.
It doesn't work because url isn't valid. If you want to keep using
scrapy.Request
, you could use:A shorter solution:
To get your code working, you need to fix the broken link by using
response.follow()
or something similar. Try the below approach.Your pasted code was badly indented. I've fixed that as well.