How to extract exact tags in scrapy

I wrote a class for scrapy in order to get the piece of content of a page like so:

#!/usr/bin/python
import html2text
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class StockSpider(BaseSpider):
    name = "stock_spider"
    allowed_domains = ["www.hamshahrionline.ir"]
    start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
#       sample = hxs.select("WhatShouldIputHere").extract()[AndHere]
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print converter.handle(sample)

My main problem is the state that I commented it.

How can I set path and extract parameter for that?

Can you guide me over this and give me some examples?

Thank you

标签： python html web-scraping scrapy extract

1条回答

forever°为你锁心

2楼-- · 2019-07-28 18:03

First you need to decide what data do you want to get out of the page, define an Item class and a set of Fields. Then, in order to fill item fields with data, you need use xpath expressions in the parse() method of your spider.

Here's an example that retrieves all of the paragraphs out of the body (all news, I suppose):

from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector


class MyItem(Item):
    content = Field()


class StockSpider(Spider):
    name = "stock_spider"
    allowed_domains = ["www.hamshahrionline.ir"]
    start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]

    def parse(self, response):
        sel = Selector(response)
        paragraphs = sel.xpath("//div[@class='newsBodyCont']/p/text()").extract()
        for p in paragraphs:
            item = MyItem()
            item['content'] = p
            yield item

Note that I'm using a Selector class since HtmlXPathSelector is deprecated. Also, I'm using xpath() method instead of select() because of the same reason.

Also, note that you'd better extract your Item definition in a separate python script to follow the Scrapy project structure.

Hope that helps.

0人赞添加讨论(0) 举报

How to extract exact tags in scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间