Scrapy returns more results than expected

This is a continuation of the question: Extract from dynamic JSON response with Scrapy

I have a Scrapy spider that extract values from a JSON response. It works well, extract the right values, but somehow it enters in a loop and returns more results than expected (duplicate results).

For example for 17 values provided in test.txt file it returns 289 results, that means 17 times more than expected.

Spider content below:

import scrapy
import json
from whois.items import WhoisItem

class whoislistSpider(scrapy.Spider):
    name = "whois_list"
    start_urls = []
    f = open('test.txt', 'r')
    global lines
    lines = f.read().splitlines()
    f.close()
    def __init__(self):
        for line in lines:
            self.start_urls.append('http://www.example.com/api/domain/check/%s/com' % line)

    def parse(self, response):
        for line in lines:
            jsonresponse = json.loads(response.body_as_unicode())
            item = WhoisItem()
            domain_name = list(jsonresponse['domains'].keys())[0]
            item["avail"] = jsonresponse["domains"][domain_name]["avail"]
            item["domain"] = domain_name
            yield item

items.py content below

import scrapy

class WhoisItem(scrapy.Item):
    avail = scrapy.Field()
    domain = scrapy.Field()

pipelines.py below

class WhoisPipeline(object):
    def process_item(self, item, spider):
        return item

Thank you in advance for all the replies.

标签： python json web-scraping scrapy web-crawler

1条回答

劳资没心，怎么记你

2楼-- · 2019-07-27 02:42

The parse function should be like this:

def parse(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    item = WhoisItem()
    domain_name = list(jsonresponse['domains'].keys())[0]
    item["avail"] = jsonresponse["domains"][domain_name]["avail"]
    item["domain"] = domain_name
    yield item

Notice that I removed the for loop.

What was happening: for every single response you would loop and parse it 17 times. (Therefore resulting in 17*17 records)

0人赞添加讨论(0) 举报

Scrapy returns more results than expected

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间