I wrote a class for scrapy in order to get the piece of content of a page like so:
#!/usr/bin/python
import html2text
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class StockSpider(BaseSpider):
name = "stock_spider"
allowed_domains = ["www.hamshahrionline.ir"]
start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
# sample = hxs.select("WhatShouldIputHere").extract()[AndHere]
converter = html2text.HTML2Text()
converter.ignore_links = True
print converter.handle(sample)
My main problem is the state that I commented it.
How can I set path and extract parameter for that?
Can you guide me over this and give me some examples?
Thank you
First you need to decide what data do you want to get out of the page, define an
Item
class and a set ofField
s. Then, in order to fill item fields with data, you need usexpath
expressions in theparse()
method of your spider.Here's an example that retrieves all of the paragraphs out of the body (all news, I suppose):
Note that I'm using a
Selector
class sinceHtmlXPathSelector
is deprecated. Also, I'm usingxpath()
method instead ofselect()
because of the same reason.Also, note that you'd better extract your
Item
definition in a separate python script to follow the Scrapy project structure.Hope that helps.