Scrapy: ValueError('Missing scheme in request

I am trying to scrape data from a webpage. The webpage is simply a bullet list of 2500 URLs. Scrapy fetch and goes to each and every URL and fetch some data ...

Here is my code

class MySpider(CrawlSpider):
    name = 'dknews'
    start_urls = ['http://www.example.org/uat-area/scrapy/all-news-listing']
    allowed_domains = ['example.org']

    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"dkpagetype"})
        ptitle = soup.find_all(attrs={"name":"dkpagetitle"})
        pturl = soup.find_all(attrs={"name":"dkpageurl"})
        ptdate = soup.find_all(attrs={"name":"dkpagedate"})
        ptdesc = soup.find_all(attrs={"name":"dkpagedescription"})
         for node in soup.find_all("div", class_="module_content-panel-sidebar-content"):
           ptbody = ''.join(node.find_all(text=True))  
           ptbody = ' '.join(ptbody.split())
           nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
           nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
           nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
           nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
           nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
           nf['bodytext'] = ptbody.encode('ascii', 'ignore')
         yield nf
            for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
             yield Request(url, callback=self.parse)

Now the problem is that the above code scrapes around 215 out of 2500 articles. It closes by giving this error ...

ValueError('Missing scheme in request url: %s' % self._url)

I have no idea what is causing this error ....

Any help is very appreciated.

Thanks

标签： python scrapy

1条回答

乱世女痞

2楼-- · 2019-06-02 11:58

Update 01/2019

Nowdays Scrapy's Response instance has pretty convenient method response.follow which generates Request from the given URL (either absolute or relative or even Link object generated by LinkExtractor) using response.url as the base:

yield response.follow('some/url', callback=self.parse_some_url, headers=headers, ...)

Docs: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow

Code below looks like the issue:

 for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
     yield Request(url, callback=self.parse)

if any of urls is not fully qualified, e.g. looks like href="/path/to/page" rather than href="http://example.com/path/to/page" you'll get the error. To ensure you yielding correct requests you can use urljoin:

    yield Request(response.urljoin(url), callback=self.parse)

Scrapy way is to use LinkExtractor though https://doc.scrapy.org/en/latest/topics/link-extractors.html

0人赞添加讨论(0) 举报

Scrapy: ValueError('Missing scheme in request

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间