Using scrapy to find specific text from multiple w

2019-05-28 09:21发布

问题:

I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation, but I can't seem to find this.

Thank you.

class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
    self.page_number = starting_number

def start_requests(self):
    # generate page IDs from 1000 down to 501
    for i in range (self.page_number, number_of_pages, -1):
        yield Request(url = URL % i, callback=self.parse)

def parse(self, response):
    **parsing data from the webpage**

回答1:

You'll need to use some parser or regex to find the text you are looking for inside the response body.

every scrapy callback method contains the response body inside the response object, which you can check with response.body (for example inside the parse method), then you'll have to use some regex or better xpath or css selectors to go to the path of your text knowing the xml structure of the page you crawled.

Scrapy lets you use the response object as a Selector, so you can go to the title of the page with response.xpath('//head/title/text()') for example.

Hope it helped.