Make scrapy follow links in order

I wrote a script and used Scrapy to find links in the first phase and follow the links and extract something from the page in the second phase. Scrapy DOES it BUT it follows the links in an unordered manner, i.e. I expect an output as below:

link1 | data_extracted_from_link1_destination_page
link2 | data_extracted_from_link2_destination_page
link3 | data_extracted_from_link3_destination_page
.
.
.

but I get

link1 | data_extracted_from_link2_destination_page
link2 | data_extracted_from_link3_destination_page
link3 | data_extracted_from_link1_destination_page
.
.
.

here is my code:

import scrapy


class firstSpider(scrapy.Spider):
    name = "ipatranscription"
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']

    def parse(self, response):
        body = response.xpath('./body/div[3]/div[1]/div/a')
        LinkTextSelector = './text()'
        LinkDestSelector = './@href'

        for link in body:
            LinkText = link.xpath(LinkTextSelector).extract_first()
            LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first())

            yield {"LinkText": LinkText}
            yield scrapy.Request(url=LinkDest, callback=self.parse_contents)

    def parse_contents(self, response):

        lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
        sContent = ""
        for i in lContent:
            sContent += i
        sContent = sContent.replace("\n", "").replace("\t", "")
        yield {"LinkContent": sContent}

What is wrong in my code?

标签： python web-scraping scrapy

1条回答

该账号已被封号

2楼-- · 2019-09-19 08:02

yield is not synchronous, you should use meta to achieve this. Doc: https://doc.scrapy.org/en/latest/topics/request-response.html
Code:

import scrapy
class firstSpider(scrapy.Spider):
    name = "ipatranscription"
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']
    def parse(self, response):
        body = response.xpath('./body/div[3]/div[1]/div/a')
        LinkTextSelector = './text()'
        LinkDestSelector = './@href'
        for link in body:
            LinkText = link.xpath(LinkTextSelector).extract_first()
            LinkDest = 
              response.urljoin(link.xpath(LinkDestSelector).extract_first())
            yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText})

    def parse_contents(self, response):
        lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
        sContent = ""
        for i in lContent:
            sContent += i
        sContent = sContent.replace("\n", "").replace("\t", "")
        linkText = response.meta['LinkText']
        yield {"LinkContent": sContent,"LinkText": linkText}

0人赞添加讨论(0) 举报

Make scrapy follow links in order

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间