Not able to follow link using Scrapy

I am not able to follow the link and get back the values.

I tried using the below code I am able to crawl the first link after that it doesnt redirect to the second follow link(function).

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request


class ScrapyOrgSpider(BaseSpider):
    name = "scrapy"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/abcd"]


  def parse(self, response):
        hxs = HtmlXPathSelector(response)
        res1=Request("http://www.example.com/follow", self.a_1)
        print res1

  def a_1(self, response1):
        hxs2 = HtmlXPathSelector(response1)
        print hxs2.select("//a[@class='channel-link']").extract()[0]
        return response1

标签： python scrapy

2条回答

叛逆

2楼-- · 2019-09-14 02:01

You forgot to return your Request in the parse() method. Try this code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request


class ScrapyOrgSpider(BaseSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/abcd"]

    def parse(self, response):
        self.log('@@ Original response: %s' % response)
        req = Request("http://www.example.com/follow", callback=self.a_1)
        self.log('@@ Next request: %s' % req)
        return req

    def a_1(self, response):
        hxs = HtmlXPathSelector(response)
        self.log('@@ extraction: %s' %
            hxs.select("//a[@class='channel-link']").extract())

Log output:

2012-11-22 12:20:06-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: oneoff)
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled item pipelines:
2012-11-22 12:20:06-0600 [example.com] INFO: Spider opened
2012-11-22 12:20:06-0600 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-11-22 12:20:07-0600 [example.com] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example/> from <GET http://www.example.com/abcd>
2012-11-22 12:20:07-0600 [example.com] DEBUG: Crawled (200) <GET http://www.iana.org/domains/example/> (referer: None)
2012-11-22 12:20:07-0600 [example.com] DEBUG: @@ Original response: <200 http://www.iana.org/domains/example/>
2012-11-22 12:20:07-0600 [example.com] DEBUG: @@ Next request: <GET http://www.example.com/follow>
2012-11-22 12:20:07-0600 [example.com] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example/> from <GET http://www.example.com/follow>
2012-11-22 12:20:08-0600 [example.com] DEBUG: Crawled (200) <GET http://www.iana.org/domains/example/> (referer: http://www.iana.org/domains/example/)
2012-11-22 12:20:08-0600 [example.com] DEBUG: @@ extraction: []
2012-11-22 12:20:08-0600 [example.com] INFO: Closing spider (finished)

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-09-14 02:09

The parse function must return the request, not just print it.

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    res1 = Request("http://www.example.com/follow", callback=self.a_1)
    print res1  # if you want
    return res1

0人赞添加讨论(0) 举报

Not able to follow link using Scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间