Recursive Scraping on Craigslist with Scrapy

2019-05-23 04:05发布

I have been trying to hone my python skills by building scrapers and recently switched from bs4 to scrapy so that I can use its multithreading and download delay features. I have been able to make a basic scraper and output the data to csv, but when I try to add a recursive feature I run into problems. I tried following the advice from Scrapy Recursive download of Content but keep getting the following error:

DEBUG: Retrying http://medford.craigslist.org%20%5Bu'/cto/4359874426.html'%5D> DNS lookup failed: address not found

This makes me think the way I am trying to join the links isn't work as it's inserting characters into the url, but I can't figure out how to fix it. Any advice?

Here's the code:

#-------------------------------------------------------------------------------
# Name:        module1
# Purpose:
#
# Author:      CD
#
# Created:     02/03/2014
# Copyright:   (c) CD 2014
# Licence:     <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *

class PageSpider(BaseSpider):
    name = "cto"

    start_urls = ["http://medford.craigslist.org/cto/"]

    rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@class="nextpage"]' ,))
        , callback="parse", follow=True), )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        for titles in titles:
            item = CraigslistSampleItem()
            item['title'] = titles.select("a/text()").extract()
            item['link'] = titles.select("a/@href").extract()

            url = "http://medford.craiglist.org %s" % item['link']
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)

    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

1条回答
劫难
2楼-- · 2019-05-23 04:50

Turns out your code:

 url = "http://medford.craiglist.org %s" % item['link']

generates:

http://medford.craigslist.org [u'/cto/4359874426.html']

The item['link'] returns a list in your code and not a string as you are expecting it to. You need to do this:

url = 'http://medford.craiglist.org{}'.format(''.join(item['link']))
查看更多
登录 后发表回答