Looking to see if someone can point me in the right direction in regards to using Scrapy in python.

I've been trying to follow the example for several days and still can't get the output expected. Used the Scrapy tutorial,, and even download an exact project from the github repo but the output I get is not of that described in the tutorial.

from scrapy.spiders import Spider
from scrapy.selector import Selector

from dirbot.items import Website

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = [""]
start_urls = [

  def parse(self, response):
    The lines below is a spider contract. For more info see:

    @scrapes name
    sel = Selector(response)
    sites = sel.xpath('//ul[@class="directory-url"]/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.xpath('a/text()').extract()
        item['url'] = site.xpath('a/@href').extract()
        item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')

    return items

After I downloaded the project from github, I run "scrapy crawl dmoz" at the top level directory. I get the following output:

2016-08-31 00:08:19 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-08-31 00:08:19 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'dirbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders']}
2016-08-31 00:08:19 [scrapy] INFO: Enabled extensions:
2016-08-31 00:08:19 [scrapy] INFO: Enabled downloader middlewares:
2016-08-31 00:08:19 [scrapy] INFO: Enabled spider middlewares:
2016-08-31 00:08:19 [scrapy] INFO: Enabled item pipelines:
2016-08-31 00:08:19 [scrapy] INFO: Spider opened
2016-08-31 00:08:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-31 00:08:19 [scrapy] DEBUG: Telnet console listening on
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET> (referer: None)
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET> (referer: None)
2016-08-31 00:08:20 [scrapy] INFO: Closing spider (finished)
2016-08-31 00:08:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16179,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 31, 7, 8, 20, 314625),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 8, 31, 7, 8, 19, 882944)}
2016-08-31 00:08:20 [scrapy] INFO: Spider closed (finished)

Was expecting this per the tutorial:

[scrapy] DEBUG: Scraped from <200>
 {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
  'link': [u''],
  'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200>
 {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
  'link': [u''],
  'title': [u'XML Processing with Python']}


Seems like this spider is outdated in the tutorial. The website has changed a bit so all of the xpaths now capture nothing. This is easily fixable:

def parse(self, response):
    sites = response.xpath('//div[@class="title-and-desc"]/a')
    for site in sites:
        item = dict()
        item['name'] = site.xpath("text()").extract_first() 
        item['url'] = site.xpath("@href").extract_first() 
        item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
        yield item

For future reference you can always test whether a specific xpath works with scrapy shell command.
e.g. what I did to test this:

$ scrapy shell ""
# test sites xpath
# ok it doesn't work, check out page in web browser
# find correct xpath and test that:
# 21 result nodes printed
# it works!


Here is the correction of the Scrapy code to extract details from DMOZ:

import scrapy

class MozSpider(scrapy.Spider):
name = "moz"
allowed_domains = [""]
start_urls = ['',

    def parse(self, response):
        sites = response.xpath('//div[@class="title-and-desc"]')
        for site in sites:
            name = site.xpath('a/div[@class="site-title"]/text()').extract_first()
            url = site.xpath('a/@href').extract_first()
            description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip()

            yield{'Name':name, 'URL':url, 'Description':description}

To export it into CSV, open the spider folder in your Terminal/CMD and type:

scrapy crawl moz -o result.csv

Here is another basic Scrapy tutorial: to extract company details from YellowPages:

import scrapy

class YlpSpider(scrapy.Spider):
name = "ylp"
allowed_domains = [""]
start_urls = ['']

    def parse(self, response):
        companies = response.xpath('//*[@class="info"]')

        for company in companies:
            name = company.xpath('h3/a/span[@itemprop="name"]/text()').extract_first()
            phone = company.xpath('div/div[@class="phones phone primary"]/text()').extract_first()
            website = company.xpath('div/div[@class="links"]/a/@href').extract_first()

            yield{'Name':name,'Phone':phone, 'Website':website}

To export it into CSV, open the spider folder in your Terminal/CMD and type:

scrapy crawl ylp -o result.csv

This Scrapy code is to extract company details from Yelp:

import scrapy

class YlpSpider(scrapy.Spider):
    name = "yelp"
    allowed_domains = [""]
    start_urls = [',+CO']

    def parse(self, response):
        companies = response.xpath('//*[@class="biz-listing-large"]')

        for company in companies:
            name = company.xpath('.//span[@class="indexed-biz-name"]/a/span/text()').extract_first()
            address1 = company.xpath('.//address/text()').extract_first('').strip()
            address2 = company.xpath('.//address/text()[2]').extract_first('').strip()  # '' means the default attribute if not found to avoid adding None.
            address = address1 + " - " + address2
            phone = company.xpath('.//*[@class="biz-phone"]/text()').extract_first().strip()
            website = "" + company.xpath('.//@href').extract_first()

            yield{'Name':name, 'Address':address, 'Phone':phone, 'Website':website}

To export it into CSV, open the spider folder in your Terminal/CMD and type:

scrapy crawl yelp -o result.csv

