How to extract exact tags in scrapy

2019-07-28 17:37发布

I wrote a class for scrapy in order to get the piece of content of a page like so:

#!/usr/bin/python
import html2text
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class StockSpider(BaseSpider):
    name = "stock_spider"
    allowed_domains = ["www.hamshahrionline.ir"]
    start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
#       sample = hxs.select("WhatShouldIputHere").extract()[AndHere]
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print converter.handle(sample)

My main problem is the state that I commented it.

How can I set path and extract parameter for that?

Can you guide me over this and give me some examples?

Thank you

1条回答
forever°为你锁心
2楼-- · 2019-07-28 18:03

First you need to decide what data do you want to get out of the page, define an Item class and a set of Fields. Then, in order to fill item fields with data, you need use xpath expressions in the parse() method of your spider.

Here's an example that retrieves all of the paragraphs out of the body (all news, I suppose):

from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector


class MyItem(Item):
    content = Field()


class StockSpider(Spider):
    name = "stock_spider"
    allowed_domains = ["www.hamshahrionline.ir"]
    start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]

    def parse(self, response):
        sel = Selector(response)
        paragraphs = sel.xpath("//div[@class='newsBodyCont']/p/text()").extract()
        for p in paragraphs:
            item = MyItem()
            item['content'] = p
            yield item

Note that I'm using a Selector class since HtmlXPathSelector is deprecated. Also, I'm using xpath() method instead of select() because of the same reason.

Also, note that you'd better extract your Item definition in a separate python script to follow the Scrapy project structure.

Hope that helps.

查看更多
登录 后发表回答