from scrapy.spider import BaseSpider
class dmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
then I run "scrapy crawl dmoz" then I got this error:
2013-09-14 13:20:56+0700 [dmoz] DEBUG: Retrying http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (failed 1 times): Connection to other side was lost in a non-clean fashion.
Does anyone know how to fix this?
You need to check your internet connection or if you're using proxy, set your environment variables for proxy authentication.
In windows, try these steps:
alternative way: setting-proxy-env