How to use scrapy to crawl multiple pages? (two le

2019-09-02 01:37发布

On my site I created two simple pages: Here are their first html script:

test1.html :

<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>

test2.html :

<head>
<title>test2</title>
</head>
<body></body></html>

I want scraping text in the title tag of the two pages.here is "test1" and "test2". but I am a novice with scrapy I only happens scraping only the first page. my scrapy script:

from scrapy.spider import Spider
from scrapy.selector import Selector

from testscrapy1.items import Website

class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
    "http://www.exemple.com/test1.html"
]


def parse(self, response):

    sel = Selector(response)
    sites = sel.xpath('//head')
    items = []

    for site in sites:
        item = Website()

        item['title'] = site.xpath('//title/text()').extract()

        items.append(item)

    return items

How to pass the onclik? and how to successfully scraping the text of the title tag of the second page? Thank you in advance STEF

标签: scrapy
1条回答
甜甜的少女心
2楼-- · 2019-09-02 02:18

To use multiple functions in your code, send multiple requests and parse them, you're going to need: 1) yield instead of return, 2) callback.

Example:

def parse(self,response):
    for site in response.xpath('//head'):
        item = Website()
        item['title'] = site.xpath('//title/text()').extract()
        yield item
    yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)

def other_function(self,response):
    for other_thing in response.xpath('//this_xpath')
        item = Website()
        item['title'] = other_thing.xpath('//this/and/that').extract()
        yield item

You cannot parse javascript with scrapy, but you can understand what the javascript does and do the same: http://doc.scrapy.org/en/latest/topics/firebug.html

查看更多
登录 后发表回答